approximations and errorsyunus.hacettepe.edu.tr/~s.himmetoglu/notes1.pdf · 2 taylor series taylor...

1

Approximations and Errors (These notes are based mainly on the 6th Edition of the textbook

“Numerical Methods for Engineers” by S.C. Chapra and R.P. Canale.) Error Definitions Et = True Error = True value ─ Approximate value True relative error =

True percent relative error is given by εt where 휀 =

∙ 100

True percent absolute relative error is denoted by |휀 | → If the true (or exact) value or answer is not known (which is usually the case in real-world engineering problems), then one can use approximate-error definitions as follows, 휀 =

퐴푝푝푟표푥푖푚푎푡푒 푒푟푟표푟퐴푝푝푟표푥푖푚푎푡푒 푣푎푙푢푒 ∙ 100 =

푃푟푒푠푒푛푡 푎푝푝푟표푥푖푚푎푡푖표푛 − 푃푟푒푣푖표푢푠 푎푝푝푟표푥푖푚푎푡푖표푛푃푟푒푠푒푛푡 푎푝푝푟표푥푖푚푎푡푖표푛 ∙ 100

where 휀 is the approximate percent relative error. → Approximate percent absolute relative error is denoted by |휀 | Error Types → Round-off errors arise from the fact that computers can only represent/store quantities with a finite number of digits. π=3.14159265358979……… the omission of the remaining digits by the computer is called Round-off error. → Truncation errors result from applying numerical methods which employ approximations to represent exact mathematical operations and quantities. Example: True versus approximate derivatives

2

Taylor Series Taylor Series is an infinite power-series and it predicts the value of the function at one point using the value of the function and its derivatives at another point as follows:

푓(푥 ) = 푓(푥 ) + 푓 (푥 )ℎ + 푓 (푥 )

2! ℎ +푓( )(푥 )

3! ℎ +⋯⋯⋯+푓( )(푥 )푛! ℎ + 푅

The nth-order Taylor Series approximation of 푓(푥 ) around (about) 푥 is therefore given by:

푓(푥 ) ≅ 푓(푥 ) + 푓 (푥 )ℎ + 푓 (푥 )

2! ℎ +푓( )(푥 )

3! ℎ +⋯⋯⋯+푓( )(푥 )푛! ℎ

The remainder term (the truncation error for the nth-order approximation) 푅 is given by:

푅 =( )( )( )!

ℎ where 푥 < 휉 < 푥 and ℎ = 푥 − 푥 (h is the step size)

Zero-order approximation: 푓(푥 ) ≅ 푓(푥 ) First-order approximation: 푓(푥 ) ≅ 푓(푥 ) + 푓 (푥 )ℎ Second-order approximation: 푓(푥 ) ≅ 푓(푥 ) + 푓 (푥 )ℎ + ( )

!ℎ

→ The nth-order Taylor Series expansion will be exact for an nth-order polynomial. → In general, the truncation error is decreased by including additional terms in the expansion but we should also limit the step size. We can also decrease the truncation error if we reduce the step size sufficiently (note that when the step size is reduced, the term 푓( )(휉) will change). We usually select the step size as |ℎ| < 1. For example, consider two successive terms in the Taylor Series expansion which involve the factors ℎ and ℎ , respectively. If the step size is taken as 10 (where p is a positive integer), then the ratio ℎ /ℎ equals 10 . This shows that if 푝 increases, then the ratio ℎ /ℎ also increases. Let’s make a formal definition of Taylor Series as follows: If 푓(푥) has derivatives of all orders at 푥 = 푎 (i.e. if 푓( )(푎) exists for 푛 = 0,1,2,⋯ ), then the series

푓( )(푎)푛!

(푥 − 푎)

is called the Taylor series of 푓 about the number 푎. If 푎 = 0, then we use the term Maclaurin series instead of Taylor series. The important question is: Can we write 푓(푥) = ∑

( )( )!

(푥 − 푎) , in other words does the Taylor series of 푓 converge to 푓(푥) ? It is clear that the Taylor series of 푓 converges to 푓(푎) at 푥 = 푎. However, we cannot directly say that the Taylor series of 푓 (that is infinitely differentiable at 푥 = 푎 ) converges to 푓(푥) in an open interval containing 푥 = 푎 ; the series may not converge anywhere except at 푥 = 푎, and if it does converge elsewhere, it may converge to something else other than 푓(푥). If the Taylor series of 푓 converges to 푓(푥) in an open interval containing 푥 = 푎, it is said that 푓 is analytic at 푥 = 푎.

3

Fig. 1.1 푓(푥) : known function or exact solution, 푃 (푥): 0-order approx., 푃 (푥): 1st-order approx., 푃 (푥): 2nd-order approx.

Taylor Polynomials Taylor polynomials are partial sums of the Taylor series of a function 푓 about a given point 푥 = 푎. We use Taylor polynomials to obtain approximations for a given function 푓. The nth-order Taylor polynomial of 푓 about 푥 = 푎 is denoted by 푃 which is given as follows.

푃 (푥) = 푓(푎) + 푓 (푎)(푥 − 푎) +푓 (푎)

2!(푥 − 푎) +

푓( )(푎)3!

(푥 − 푎) + ⋯+푓( )(푎)푛!

(푥 − 푎) Notice that 푃 matches 푓 and its first n derivatives at 푥 = 푎 i.e.

푃 (푎) = 푓(푎), 푃 (푎) = 푓 (푎), ⋯⋯ 푃 ( )(푎) = 푓( )(푎) → For example, the 2nd-order Taylor Series approximation of a function 푓 around point 푥 = 푥 is equivalent to constructing a 2nd-order polynomial 푃 (푥) as given below. In other words, 푃 (푥) is the 2nd-order Taylor polynomial of 푓 around 푥 = 푥 .

푃 (푥) = 푓(푥 ) + 푓 (푥 )(푥 − 푥 ) + 푓 (푥 )

2!(푥 − 푥 )

Note that 푃 (푥 ) = 푓(푥 ) , 푃 ′(푥 ) = 푓′(푥 ) , and 푃 ′′(푥 ) = 푓′′(푥 ). → Taylor’s Theorem says that if 푓( ) exists in an interval containing 푎 and 푥, and if 푃 (푥) is the nth-order Taylor polynomial of 푓 about 푥 = 푎, then 푓(푥) = 푃 (푥) + 푅 (푥) where

푅 (푥) =푓( )(휉)(푛 + 1)!

(푥 − 푎) 푓표푟 푠표푚푒 휉 푏푒푡푤푒푒푛 푥 푎푛푑 푎

4

Computer Representation of Numbers Significant Digits and Error Criteria Significant digits (or figures) of a number are those that can be used with confidence (i.e. which we know for sure). The word “significant” means important, meaningful. Significance is a sign of precision. Computers (or machines) use rounding or simple chopping when dealing with numbers. Consider the numbers, 1/6=0.1666666666……… and 1/3=0.33333333………

16667.06/1 (rounding applied), 16666.06/1 (chopping applied) 33333.03/1 (rounding applied), 33333.03/1 (chopping applied)

Examples: a) 0.01625 0.001625 0.0001625 All of the three numbers have 4 significant digits. The leading zeros locate the decimal point only. We can remove these leading zeros and can still express the given numbers correctly; therefore the leading zeros are not significant i.e. not important. Using scientific notation can help to see the number of significant digits (or figures). Rewrite the given numbers using scientific notation as follows:

5101625 6101625 7101625 b) The trailing zeros may cause confusion. Consider the number 73600 (e.g. consider the population of a town. Is it really and exactly 73600 or you’ve just rounded it up, not being sure about the last two zeros). So, are the last two zeros exact? Scientific notation can be used to indicate clearly the number of significant digits. Then, 4103600.7 , 410360.7 and 41036.7 have five, four and three significant digits, respectively. c) 5001.35 has 6 significant digits. 63.18 has 4 significant digits. 63.1800 has 6 significant digits (you’re sure that this number has exactly two trailing zeros). Note that 63.1800 is actually and exactly equal to 63.18 thus you may still get confused regarding whether the last two zeros are significant or not. d) The number 271828182845.9046 is typed in a hand (pocket) calculator which can only retain 10 decimal significant digits. The output is 111082.71828182 and it has the right order of magnitude but is correct only to the first 10 significant digits. e) The number 12345678912345678912345 which has 23 digits is typed in Matlab and when the enter key is pressed, the number is stored and displayed as 1.234567891234568e+022 . It can be seen that the output has the right order of magnitude but is correct only to the first 16 significant digits because by default Matlab stores numeric quantities using 16 significant decimal digits. Notice that the output can also be expressed as 12345678912345680000000 . f) Suppose you’re asked to display 2 significant figures (or digits) when noting down the result of a calculation made by the computer. Then, 0.01625 becomes 0.016 which is correct to 2 significant figures. 0.001625 becomes 0.0016 which is correct to 2 significant figures. 0.0001625 becomes 0.00016 which is correct to 2 significant figures. g) Write down the result 63.0180057 using 3, 6 and 8 significant digits and apply rounding. Notice that the zero digits in between non-zero digits are significant i.e. important because if we remove these zero digits, the number is destroyed. 3 significant digits: 63.0 (the last 0 is significant), 6 significant digits: 63.0180 (the last 0 is significant) 8 significant digits: 63.018006

5

h) Consider that 135.8 (in scientific notation 3101358.0 ) is the true value of a quantity. The numbers 100, 140 and 136 are approximations for 135.8. Now, starting from the leftmost digit, apply 1 to 3 significant digit approximations for 135.8 and employ rounding as follows: 100 is correct to 1 significant digit, in scientific notation: 3101.0 . (Note: 104 can also be said to be correct to 1 significant digit) 140 is correct to 2 significant digits, in scientific notation: 31014.0 . (The zero which comes after 4, is insignificant (i.e. not significant). Also, the zero in the end is not obtained due to rounding so it is not significant. 141 is also correct to 2 significant digits.) 136 is correct to 3 significant digits, in scientific notation: 310136.0 . → Numerical analyses can involve iterative procedures such as using a formula repeatedly until the result converges to a value with regard to a stopping criterion. Computations can be repeated until |휀 | < 휀 where 휀 is the stopping criterion, prespecified tolerance or error criterion. Thus, the change in the result is monitored by calculating |휀 |. It can be shown that when the criterion )105.0( 2 n

s % is met (i.e.

when |휀 | < 휀 ), then the result is correct to at least n significant digits. This is a conservative criterion and it can yield results which are correct to more than n significant digits (see Example 3.2 in the textbook). Decimal and Binary Numbers As opposed to the decimal (base-10) system we use everyday, the numbers on computers are represented with a binary (base-2) system consisting of two digits 0 and 1. Each binary digit (i.e. 0 or 1) is called a bit and a byte is composed of 8 bits. The binary system is used in computers because the primary logic units of computers are electronic components that are either off (corresponding to 0) or on (corresponding to 1). The decimal integer 1992 is expressed as (1992) = 1 × 10 + 9 × 10 + 9 × 10 + 2 × 10 The decimal fraction 0.6432 is expressed as (0.6432) = 6 × 10 + 4 × 10 + 3 × 10 + 2 × 10 See that (0.6432) = 10 × (6 × 10 + 4 × 10 + 3 × 10 + 2) Therefore any decimal number is expressed as given below.

(푎 푎 … 푎 푎 .푓 푓 푓 푓 … ) = 푎 × 10 + 푓 × 10

Similarly, any binary number can be converted to the corresponding decimal number as given below.

(푎 푎 … 푎 푎 .푓 푓 푓 푓 … ) = 푎 × 2 + 푓 × 2

For example (1101.01) = 1 × 2 + 1 × 2 + 1 × 2 + 1 × 2 = (13.25) . Converting decimal numbers to binary numbers is a more complicated procedure. Floating-Point Representation Fractional quantities are typically represented in computers using the floating-point form ± 푚 ∙ 푏 where m is the mantissa (or significand), b is the base of the number system and e is the exponent. For example, the number 156.78 can be represented as 31015678.0 in floating-point base-10 system. The mantissa can hold only a finite number of significant digits and this leads to round-off errors. The mantissa is usually normalised if the leading digit (after the decimal point) is zero. The process of normalisation requires that 1)/1( mb (in base-10 system, this is equivalent to 11.0 m ).

6

Fig. 1.2 Floating-point storage in computers

Normalisation provides a standard representation for each floating-point number. More significant digits can be retained by normalisation and this helps to limit round-off errors. Consider the number 1/34= 0.02941176471……… Suppose, we can only store 4 decimal digits in the mantissa. 1/34 can be expressed as 0100294.0 . However, when normalisation is applied, the number 1/34 is stored in the computer as 1102941.0 . Therefore, an additional significant digit is retained by removing the insignificant zero after the decimal point (in the number 0100294.0 ). Similarly, any normalised binary floating-point number (other than zero) can be expressed as ± 푚 ∙ 2 where 푚 = (0.푓 푓 푓 … ) and 푓 = 1, hence (1/2) ≤ 푚 < 1. Floating-point means the number of significant digits is kept fixed but the decimal point moves or floats as the value of the exponent is changed. In a fixed-point system, the numbers are represented with a fixed number of digits after the decimal; for instance, 39.405, 6.000 and 0.078 all have 3 digits after the decimal point. Floating-point system can represent a greater range of numbers in comparison to a fixed-point system, therefore fixed-point representations are not preferred but they can be used to display numbers on the screen in fixed-point format (e.g. the %f character in the fprintf command of Matlab) In computers, real numbers are stored in a string (or sequence) of binary digits (bits). Figure 1.2 shows floating-point number representation (or storage) in computers. (The representation of integers is similar; please see the textbook). In comparison to integer number representation, floating-point numbers require more memory and take longer to process, however they enable fractional quantities and large numbers to be represented in the computer. Floating-point numbers are stored in single or double precision format according to IEEE 754 standard. (Before the IEEE 754 standard was established in 1985, each computer had its own floating-point number system). In both single and double precision, the bit allocated for the sign of the number is 0 for positive numbers and 1 for negative numbers. Single-precision format The IEEE standard single-precision normalised floating-point number is expressed as

(−1) × 2 × (1.푓) where 푠 = 0 corresponds to + sign and 푠 = 1 corresponds to – sign of the number. In single-precision, 32 bits (4 bytes) are used to store the floating-point number (1 bit for the sign, 8 bits for the exponent and 23 bits for the mantissa) and this corresponds to about 7 significant decimal digits. The exponent (푐 − 127) is an integer and only the number 푐 is stored in the 8 bits allocated for the exponent. Thus, the number 푐 is restricted to be

0 < 푐 < (1111 1111) = 255 Therefore, 1 = (0000 0001) ≤ 푐 ≤ (1111 1110) = 254 which means the exponent of the normalised floating-point number has the range −126 ≤ 푐 − 127 ≤ 127. In the mantissa, the point between 1 and 푓 is called the binary point. The number 푐 is an integer since we can move the binary point only if we use

7

integers as exponents when normalising numbers; what's more integers are exactly representable in the binary system. In the mantissa of a nonzero number, the first bit is always 1 but this bit is not stored and it is called the hidden bit. Therefore, the 23 bits allocated for the mantissa, only store the number 푓. The mantissa of each nonzero number is restricted to be

1 ≤ (1.푓) ≤ (1.1111 1111 1111 1111 1111 111) = 2 − 2 The largest positive normalised floating-point number is (2 − 2 )2 ≅ 3.4028 × 10 . The smallest positive normalised floating-point number is 2 ≅ 1.1755 × 10 . Machine epsilon 휖 is defined as the distance from 1.0 to the next larger floating-point number. It can be seen that in single-precision the next larger floating-point number after 1 is 1 + 2 = (1.00 … 001) thus 휖 = 2 ≅ 1.2 × 10 and single-precision corresponds to approximately 7 significant decimal digits of precision. In summary, a single-precision (32-bit) normalised floating-point number is written as a string of bits as follows

푏 푏 푏 … 푏 푏 푏 푏 … 푏 푏 and the above floating-point number is equal to the real number given below.

(−1) × 2( … ) × 2 × (1. 푏 푏 … 푏 푏 ) Although 2 is the smallest positive normalised floating-point number, we can store smaller unnormalised numbers which are called subnormal numbers. In subnormal numbers, 푐 = 0 i.e. the exponent field is composed of all zero bits and the initial unstored bit is 0, not 1. Hence, single precision positive subnormal numbers are represented by (0. 푏 푏 … 푏 푏 ) × 2 . For instance, 2 is stored in the computer using the following string of bits given below ( 2 = (0.01) × 2 ).

0 00000000 01000000000000000000000 The number zero is also a subnormal number whose mantissa is composed of all zero bits as shown below.

0 00000000 00000000000000000000000 The smallest positive number we can represent is 2 = 2 × 2 = (0.00 … 01) × 2 and it is stored in the computer using the following string of bits given below.

0 00000000 00000000000000000000001 In Matlab, the function single converts numbers to single precision. For example, single(2^(-149)) equals 1.4012985e-45 whereas single(2^(-150)) and single(-2^(-150)) are both set to zero. Note that in Matlab, realmin('single') equals 1.1754944e-38. If a computation produces a result whose absolute value is greater than the largest positive normalised floating-point number, then overflow occurs and the result is set to ∞ (Inf) or −∞ (-Inf) depending on whether the result is positive or negative. If a computation gives a result that is undefined even in the real number system, the result is a value called Not-a-Number (NaN). 0/0 , Inf-Inf and any arithmetic operation involving a NaN produce NaN in computers. If the result is set to ±∞, then 푐 = 255 i.e. the bitstring of the exponent field is composed of all ones and the bits for the number 푓 in the mantissa are all set to zero. If the result is set to NaN, then 푐 = 255 and the bits for the number 푓 are not completely set to zero. Example: Let's express the decimal number -52.234375 as a single-precision normalised floating-point number. It can be shown that (52.234375) = (110100.001111) = (1.10100001111) × 2 . The exponent is (5) thus (푐 − 127) = 5 then the number 푐 = 132 must be stored in the computer. Then, (132) = (10000100) . Finally the representation of -52.234375 is

1 10000100 101 000 011 110 000 000 000 00

8

Double-precision format Storing numbers in double-precision format is very similar to that of single-precision format. The IEEE standard double-precision normalised floating-point number is expressed as

(−1) × 2 × (1.푓) = ∓(1 + 푟) ∙ 2 where 푠 = 0 corresponds to + sign, 푠 = 1 corresponds to – sign of the number and 0 ≤ 푟 < 1. In double-precision, 64 bits (8 bytes) are used to store the floating-point number (1 bit for the sign, 11 bits for the exponent to store 푐, and 52 bits for the mantissa to store 푓 ) and this corresponds to about 16 significant decimal digits. Therefore, single-precision numbers need less storage than double-precision numbers, but have less precision and a smaller range. Therefore, when storing and doing arithmetics with floating-point numbers, double precision should be used to limit round-off errors. Thus, the number 푐 is restricted to be

0 < 푐 < (1111 1111 111) = 2047 Then, 1 ≤ 푐 ≤ 2046 which means the exponent 푒 of the normalised floating-point number has the range −1022 ≤ 푒 = (푐 − 1023) ≤ 1023. Note that the exponent 푒 is an integer. The mantissa of each nonzero number is restricted to be

1 ≤ (1.푓) ≤ (1.1111 1111 … … … … 1111 1111) = 2− 2 The largest positive normalised floating-point number is (2 − 2 )2 ≅ 1.7977 × 10 . The smallest positive normalised floating-point number is 2 ≅ 2.2251 × 10 . In double precision, the machine epsilon is 2 ≅ 2.2204 × 10 thus double-precision corresponds to approximately 16 significant decimal digits of precision. By default, Matlab stores all numeric quantities, values as double-precision floating-point. In Matlab, machine epsilon is given by eps or eps(1). (The function eps(x) gives the distance between x and the next larger floating-point number. eps(1)=eps=2.220446049250313e-016). In Matlab, the smallest positive normalised floating-point number in double-precision is given by realmin and realmin=2^(-1022). In Matlab, the largest positive floating-point number in double-precision is realmax and realmax=(2-eps)*2^1023. Results or numbers higher than realmax or less than -realmax are assigned the values Inf and -Inf, respectively hence in both cases overflow error occurs. Inf is represented by taking 푒 = 1024 and 푟 = 0. NaN is represented by taking 푒 = 1024 and 푟 nonzero. If the absolute value of a number or result (in double-precision) is smaller than realmin, then underflow occurs. However, many computers allow subnormal (or denormal) floating-point numbers in the interval between realmin and eps*realmin. Therefore, the smallest positive subnormal number is about 4.94e-324 and any result smaller than this is set to zero. (Examples: realmin*10^(-16)=0 , -realmin*10^(-16)=0). Subnormal numbers are represented by taking 푐 = 0 i.e. 푒 = −1023. In computers without subnormal floating-point numbers, any result or number whose absolute value is smaller than realmin, is set to zero. Intel microprocessors can also use an internal extended-precision format in which 80 bits are used (1 bit for the sign, 15 bits for the exponent and 64 bits for the mantissa). This extended-precision format is used within the computer's hardware arithmetic unit and it is not always available to the programmer. Problem1: Convert the given double-precision normalised floating-point number into a decimal number. In double-precision format, a normalised floating-point number is expressed as (−1) × 2 × (1.푓) . Remember that in double-precision format, there is 1 bit for the sign, 11 bits for the exponent and 52 bits for the mantissa.

1 10000001010 1001001100000000000000000000000000000000000000000000 Solution1: The sign is (-), 푐 = 1 ∙ 2 + 1 ∙ 2 + 1 ∙ 2 = 1034, 푐 − 1023 = 11 (1.푓) = 1 + 1 ∙ 2 + 1 ∙ 2 + 1 ∙ 2 + 1 ∙ 2 = 1.57421875 Then the given number is equal to -2 × (1.푓) = −3224

9

Fig. 1.3 A floating-point number system (only positive numbers shown; an identical set extends in the opposite direction i.e. the negative range is symmetric)

Problem2: Consider a hypothetical binary computer that uses a quarter-precision format in which the normalised floating-point numbers are expressed as (−1) × 2 × (1.푓) . In this format, 8 bits are used: 1 bit for the sign, 3 bits to store the exponent 푐, and 4 bits to store 푓 in the mantissa. Knowing that 푐 is a nonnegative integer, show that the maximum value of 푐 is 7. For the normalised floating-point numbers, this format requires that −2 ≤ 푐 − 3 ≤ 3. Determine the largest and smallest positive normalised floating-point numbers and the machine epsilon. Solution2: 0 < 푐 < 푐 = (111) = 1 × 2 + 1 × 2 + 1 × 2 = 7 Largest positive normalised floating point number is: 2 × (1.1111) = 2 (1 × 2 + 1 × 2 + 1 × 2 + 1 × 2 + 1 × 2 ) = 15.5 Smallest positive normalised floating point number is: 2 × (1.0000) = 0.25 Machine epsilon is eps, then 1+eps=(1.0001) ⟹ 푒푝푠 = 2 = 0.0625 Floating-Point Representation and Round-off Error A real number is called a floating-point number or machine number if it can be represented exactly in the floating-point number system of the computer. Most real numbers are not floating-point numbers as there is only a finite number of floating-point numbers. Therefore, computers apply approximations (in the form of rounding or chopping) to match real numbers with floating-point numbers. The errors introduced by these approximations are also called as quantizing errors. The use of floating-point representation has some consequences related to round-off error as depicted in Figure 1.3 and as given in the below list. (See the textbook for further details). → The interval between floating-point numbers increases as the numbers grow in magnitude. The next larger number is determined by the length of the mantissa. The distances between single-precision numbers are larger than the distances between double-precision numbers. → Very small and very large numbers cannot be represented in the computers, leading to underflow and overflow errors, respectively. When an underflow occurs, computers return zero. The subnormal numbers fill in the hole between zero and the smallest positive floating-point number. Matlab returns Inf or -Inf when a result is too large and cannot be represented hence causing overflow. An overflow error can cause the computations to terminate. → Only a finite number of quantities can be represented in the computers. In other words, most of the real numbers cannot be represented exactly in the computers. The precision is limited by the number of significant digits allowed in the mantissa. Irrational numbers (such as 휋 and √2) cannot be exactly

10

Fig. 1.4 Rounding and chopping operations

represented. What's more, since computers employ the binary (base-2) system, they cannot represent some base-10 rational numbers such as 0.1 . Using exactly representable numbers enables us to make calculations that incur no round-off error. Every integer, even when defined as a floating-point number, is exactly representable. By default, Matlab uses the floating-point system to express integers. The term "flint" is sometimes used to indicate a floating-point number whose value is an integer. Therefore, floating-point operations (addition, subtraction, multiplication) on flints do not introduce any round-off error unless the results are too large. For example, we can calculate factorials using floating-point numbers without any round-off error. Divisions and square roots involving flints also produce a flint if the result is an integer. For instance, sqrt(363/3) gives 11 with no round-off error. Consider the floating-point number system given in Figure 1.4. In chopping, any real number falling within an interval of length ∆푥 is stored as the floating-point number at the lower end of the interval which means the upper error bound for chopping is ∆푥. In rounding, any real number falling within an interval of length ∆푥 is stored as the nearest allowable floating-point number therefore the upper error bound for rounding is ∆푥/2. Rounding produces lower error in comparison to chopping. A real number 푥 cannot be exactly represented by a floating-point number i) if it is too large or too small ii) if the mantissa of 푥 requires more bits than the floating-point format can accommodate To handle the problem given in case (ii), we should approximate 푥 by using the closest floating-point number available in the computer. Suppose 푥 is positive and not a floating-point number, then

푥 = 푏 .푏 푏 … …푏 푏 푏 푏 … × 2 where 푏 = 1 if 푥 is normalised and 푏 = 0 if 푥 is in subnormal category. 푝 = 23 for single-precision format and 푝 = 52 for double-precision format. The closest floating-point number less than 푥 is denoted by 푥 and it is obtained by discarding the bits 푏 푏 … as given below.

푥 = 푏 .푏 푏 … …푏 푏 × 2 The closest floating-point number greater than 푥 is denoted by 푥 which must be written as follows.

푥 = 푏 .푏 푏 … …푏 푏 + (0.00 … … 01) × 2 If 푥 is negative, the definition of 푥 and 푥 is reversed; 푥 is obtained by discarding the bits 푏 푏 … . We use the term 푓푙(푥) to indicate the floating-point approximation of 푥. Thus if 푥 is a floating-point number then 푓푙(푥) = 푥. IEEE defines four types of 푓푙(푥). 1) Round down: 푓푙(푥) = 푥 2) Round up: 푓푙(푥) = 푥 3) Round towards zero: 푓푙(푥) = 푥 if 푥 > 0 and 푓푙(푥) = 푥 if 푥 < 0. This process is also called chopping.

11

4) Round to nearest: 푓푙(푥) is either 푥 or 푥 whichever is nearer to 푥. If 푥 and 푥 are equally close to 푥, the one whose mantissa ends with a zero is chosen. The default rounding type in IEEE arithmetic is round to nearest. Round to nearest is also called simply as rounding. The absolute rounding error is defined as 퐸 = |푓푙(푥) − 푥|. In chopping, 퐸 < |푥 − 푥 | = 2 × 2 where the machine epsilon is 휖 = 2 . In rounding (i.e. round to nearest), 퐸 is no more than half the gap between 푥 and 푥 thus 퐸 ≤ 2 ( ) × 2 = 휖 × 2 . Notice that the upper bound for 퐸 is proportional to the magnitude of the number being represented. The relative rounding error is defined as 퐸 = |푓푙(푥) − 푥|/|푥|. Assume that 푥 is expressed as a normalised number as given below.

푥 = 1. 푏 푏 … …푏 푏 푏 푏 … × 2 Then |푥| ≥ 2 . In chopping, since |푓푙(푥) − 푥| < 휖 × 2 , we get 퐸 = |푓푙(푥) − 푥|/|푥| < 휖 × 2 /2 = 휖. Similarly, in rounding (i.e. round to nearest), 퐸 = |푓푙(푥) − 푥|/|푥| ≤ 휖/2. Rounding and Chopping in a Hypothetical Decimal Computer Suppose that machine numbers are represented in normalised decimal floating-point form in a decimal computer which can store only 푘 decimal digits in the mantissa as given below where 푛 is an integer.

∓0.푑 푑 … …푑 푑 × 10 In the above representation, 1 ≤ 푑 ≤ 9 and 0 ≤ 푑 ≤ 9 for 푖 = 2, … ,푘. Consider a positive real number 푥 expressed in normalised form as given below.

푥 = 0.푑 푑 … …푑 푑 푑 푑 … × 10 Then, 푓푙(푥) is obtained by ending the mantissa of 푥 at 푘 decimal digits using either chopping or rounding. In chopping, we discard the digits 푑 푑 … to obtain 푓푙(푥) = 0.푑 푑 … …푑 푑 × 10 . In rounding, if 푑 ≥ 5, we simply add 1 to 푑 after we end the mantissa at 푘 digits, then we obtain

푓푙(푥) = [(0.푑 푑 … …푑 푑 ) + (0.00 … … 01)] × 10 In rounding, if 푑 < 5, we get 푓푙(푥) = 0.푑 푑 … …푑 푑 × 10 . Computer Arithmetics and Associated Problems due to Round-off Error In computer arithmetics, the problem is the loss of significance (or precision) as a result of round-off errors. For simplicity, arithmetic operations will be performed using base-10 numbers. Arithmetic operations in other number base systems are performed in a similar manner. The normalised decimal floating-point representation of a real number 푥 can be denoted by 푓푙(푥). In other words, 푓푙(푥) is obtained by first normalising 푥 and then terminating the mantissa using either rounding or chopping. The symbols ⊕, ⊖, ⊗, ⊘ can be used to indicate machine (computer) addition, subtraction, multiplication and division, respectively. Then, computer arithmetic operations can be defined as:

푢 ⊕푤 = 푓푙 푓푙(푢) + 푓푙(푤) , 푢 ⊖푤 = 푓푙 푓푙(푢) − 푓푙(푤) 푢 ⊗푤 = 푓푙 푓푙(푢) × 푓푙(푤) , 푢⊘푤 = 푓푙 푓푙(푢) ÷ 푓푙(푤)

Example1: Subtractive cancellation (Subtracting two nearly equal floating-point numbers) Compute 푥 − sin (푥) for 푥 = 1/15 using a hypothetical decimal computer which applies rounding and has a 10-digit mantissa. Note that x is in radians.

푥 = 0.6666666667 × 10

12

sin(푥) = 0.6661729492 × 10 푥 − sin(푥) = 0.0004937175 × 10

푥 − sin(푥) = 0.4937175000 × 10 (after normalisation) Note that in the final result, the last three zeros (typed in red) are not significant digits but added there just to fill the spaces at the end of the mantissa. See that in the final result, the number of significant digits has decreased. In Matlab, the result is: 4.937174327367122e-005 (16 significant decimal digits) Example2: Subtractive cancellation (Subtracting two nearly equal floating-point numbers) Consider a hypothetical decimal computer which applies chopping, and has a 4-digit mantissa and 1-digit exponent. Using this computer, subtract 0.764111 × 10 from 0.764299 × 10 .

0.764299 × 10 0.764111 × 10 −__________________

0.000188 × 10 → 0.0001 × 10 → 0.1000 × 10 The exact result in normalised form is 0.188 × 10 but the result of the computer is 0.1000 × 10 . See that in the computer’s result, the last three zeros (typed in red) are insignificant which cause a considerable computational (or round-off) error as indicated by the true percent relative error given below

0.188 − 0.10.188

× 100 = 46.81% We can minimise the effects of subtractive cancellation by using double (or extended) precision. Recasting the formulations is another useful technique (study the example in the textbook in which the quadratic formula is recast to minimise subtractive cancellation). Example3: When two floating-point numbers are added (or subtracted), the mantissa of the number with the smaller exponent is modified to make exponents the same. Consider a hypothetical decimal computer which applies chopping, and has a 4-digit mantissa and 1-digit exponent. Let’s add 0.4381 × 10 to 0.1557 × 10 .

0.004381 × 10 0.1557 × 10 +__________________

0.160081 × 10 → 0.1600 × 10 The exact result is 0.160081 × 10 but the result of the computer is 0.1600 × 10 . In the final result (i.e. in 0.1600 × 10 ), the last two zeros (typed in blue) are significant. Example4: Adding a large and a small number. Use the same hypothetical decimal computer given in Example2 and Example3. Add 4000 and 0.0010. The floating-point representations of these two numbers are 0.4000 × 10 and 0.1000 × 10 , respectively. 0.4000 × 10

0.0000001000 × 10 +________________________

0.4000001000 × 10 → 0.4000 × 10 When there is a large number of computations, the accumulation of round-off errors can be significant. Suppose you add a group of large numbers (of the order of 1000) to a group of small numbers (of the order of 0.01) i.e. 1000+1500+950+1250+………….. +0.01+0.015+0.009+0.02+……… If you use the same hypothetical decimal computer, in which direction the computation is more accurate? From left to right or from right to left?

13

Example5: In multiplication and division, the mantissas are multiplied/divided and the exponents are added. Considering that the multiplication of two n-digit mantissas will produce a 2n-digit result, computers typically hold intermediate results in a double-length register. As an example multiply 1.363 × 10 with 64.23 × 10 using the decimal computer given in Example2 and Example3. First normalise and then multiply the given numbers (which already have 4 significant digits) as follows, 0.1363 × 10 × 0.6423 × 10 = 0.08754549 × 10 . Normalising the result yields 0.8754549 × 10 . Finally apply chopping to get the final result as 0.8754 × 10 . Example6: Using a decimal computer with a 3-digit mantissa, compute 푦 = 푥 − 6.1푥 + 3.2푥 + 1.5 for 푥 = 4.71. Solution: Let’s apply 3-digit rounding arithmetic. Note that you must apply the operator precedence rules given in the next section entitled “Operator Precedence in Matlab”.

푥 ∶ 푓푙 푓푙(푥) × 푓푙(푥) = 푓푙 (0.471 ∙ 10 ) × (0.471 ∙ 10 ) = 푓푙(0.221841 ∙ 10 ) = 0.222 ∙ 10 푥 ∶ 푓푙 푓푙(푥 ) × 푓푙(푥) = 푓푙 (0.222 ∙ 10 ) × (0.471 ∙ 10 ) = 푓푙(0.104562 ∙ 10 ) = 0.105 ∙ 10

6.1푥 : 푓푙 푓푙(6.1) × 푓푙(푥 ) = 푓푙 (0.610 ∙ 10 ) × (0.222 ∙ 10 ) = 푓푙(0.13542 ∙ 10 ) = 0.135 ∙ 10 3.2푥 ∶ 푓푙 푓푙(3.2) × 푓푙(푥) = 푓푙 (0.320 ∙ 10 ) × (0.471 ∙ 10 ) = 푓푙(0.15072 ∙ 10 ) = 0.151 ∙ 10

Now, do the additions and subtractions from left to right. 푥 − 6.1푥 ∶ 푓푙 푓푙(푥 ) − 푓푙(6.1푥 ) = 푓푙 (0.105 ∙ 10 )− (0.135 ∙ 10 ) = 푓푙(−0.030 ∙ 10 ) = −0.300 ∙ 10 (푥 − 6.1푥 ) + 3.2푥 ∶ 푓푙 푓푙(푥 − 6.1푥 ) + 푓푙(3.2푥) = 푓푙 (−0.300 ∙ 10 ) + (0.151 ∙ 10 ) = −0.149 ∙ 10 (푥 − 6.1푥 ) + 3.2푥 + 1.5 ∶ 푓푙(−0.149 ∙ 10 + 0.15 ∙ 10 ) = 푓푙(−0.149 ∙ 10 + 0.015 ∙ 10 ) = −0.134 ∙ 10

Finally, the result is −13.4 The exact value of y is: −14.263899 If 3-digit chopping arithmetic was applied instead, the result would be −13.5. The summary of some of the results are tabulated as given below. 푥 푥 푥 6.1푥 3.2푥 Exact values 4.71 22.1841 104.487111 135.32301 15.072 3-digit rounding 4.71 22.2 105 135 15.1 3-digit chopping 4.71 22.1 104 134 15.0 Example7: The base-10 number 0.0001 cannot be expressed exactly in base-2 therefore we will have round-off error when we make computations with the number 0.0001. When we store 0.0001 or when we make a few computations with the number 0.0001 in the computer, this round-off error is small (i.e. insignificant) and not discernible. However, when we make a large number of computations with 0.0001, the individual round-off errors accumulate and the resulting total round-off error can be significant and discernible. Note that conversion between binary and decimal systems takes place in the computer and computers should match a binary number with the closest decimal number. See the below computations which are performed in Matlab. >> 0.0001

ans =

1.000000000000000e-04

>> 0.0001+0.0001+0.0001+0.0001

ans =

4.000000000000000e-04

--------------------------------------------------------

>> summ=0;

>> for i=1:20000

summ=summ+0.0001;

end

>> summ

14

summ =

1.999999999999796 %Notice that the true sum is 2.

--------------------------------------------------------

>> sum(0.0001*ones(1,20000)) %"sum" and "ones" are built-in Matlab functions

ans =

1.999999999999796 %Notice that the true sum is 2.

--------------------------------------------------------

We should note that Matlab has some features that help to minimise such round-off errors mentioned in this example. See the below statements written in Matlab. >> 0.0001*20000

ans =

2

--------------------------------------------------------

>> t=0:0.0001:2;

>> t(end) %See that the last element of vector t is 2, not 1.999999999999796

ans =

2

Operator Precedence in Matlab An operator is a symbol representing an action (i.e. an operation) on one or two items. Operands are the items which operators operate on. In the expression h+4 , the operator is + and the operands are h and 4. Operators can be either unary or binary i.e. they have one or two operands, respectively. A unary operator is an operator which acts on only one operand. - is a unary operator when used to negate a value, as in b=-2. A binary operator is an operator which operates on two operands. Operator precedence is the order in which Matlab evaluates an expression. Within each precedence level as shown below, operators have equal precedence and are evaluated from left to right. The operator precedence rules for the operators in Matlab are shown in the below list, ordered from the highest precedence level to the lowest precedence level: 1) Parentheses 2) Transpose (.'), power (.^), complex conjugate transpose ('), matrix power (^) 3) Unary plus (+), unary minus (-), logical negation (~) (Note: A unary operator is an operator which acts on only one operand. - is a unary operator when used to negate a value, as in b=-2. In Matlab, when we type -2^2, we get -4; and when we type (-2)^2, we get 4.) 4) Multiplication (.*), right division (./), left division (.\), matrix multiplication (*), matrix right division (/), matrix left division (\) 5) Addition (+), subtraction (-) 6) Colon operator (:) 7) Less than (<), less than or equal to (<=), greater than (>), greater than or equal to (>=), equal to (==), not equal to (~=) 8) Element-wise AND (&) 9) Element-wise OR (|) 10) Short-circuit AND (&&) 11) Short-circuit OR (||) Total Numerical Error Total numerical error is the sum of round-off errors and truncation errors. Round-off errors increase when there is a considerably large number of computations. Round-off errors may also be of concern due to problems associated with computer arithmetics (such as subtractive cancellation). In other words, round-off

15

errors can increase if the step size is decreased for a given numerical procedure. On the other hand, truncation errors decrease when the step size is decreased. Hence, the step size should be chosen optimally. However, computers today carry enough number of significant digits (extended precision) and there are also smart algorithms (such as recasting the formulation, adjusting the order of computations, efficient and accurate numerical methods) which are capable of diminishing the effects of total numerical error. We may still need some trial-and-error, experience and intuition in choosing appropriate step sizes. → Note: Truncation and round-off errors can also be magnified. Consider a number with some small error; if this number is multiplied by a large number, the error is magnified. The same situation occurs if a number containing a small error, is divided by a small number. Example8: Consider the integral 퐼 = ∫ 푓(푥)푑푥 where 푓(푥) = 푆푖푛(푥). The true value of this integral is calculated as 퐼 = −퐶표푠(푥)| = 2. We can approximate the integral 퐼 by using the Trapezoidal rule in which data points are selected on the function 푓 and each pair of these data points are connected by a straight line. The total area under these straight lines is an approximation for the integral 퐼 as shown in the below figure where 4 straight-line segments are used. The 푥-coordinates of the data points are equally spaced and the spacing is the step size ℎ. As the step size is decreased (or the number of segments is increased), the truncation error will decrease since the straight lines will more closely follow the shape of the function 푓. In order to clearly see the effect of reducing the step size on round-off and total numerical error, we can write a Matlab program. The Matlab function trapz applies the Trapezoidal rule to estimate the value of an integral numerically by approximating the function as a set of straight lines. Straight lines are simple functions whose integrals can be easily evaluated. A script m-file in Matlab can be written to investigate the total numerical error in integration as given below. clc, clear all, close all hidden f=@(x) sin(x); %Define the function f a=0; b=pi; %The lower and upper limits of the integral % n: Number of segments n0=2; %The initial number of segments % We will progressively multiply n by 5 for m times. m=input('How many times would you like to multiply the number of segments by 5?: ' ); It=2; %The true value of the given integral fprintf(' step size true value estimation |true error| \n') for i=1:(m+1) n=n0*5^(i-1); % n involves the number of segments to be applied. h=(b-a)/n; % h(i) is the corresponding step size for n(i). x=a:h:b; % x-coordinates of the data-points to be used y=f(x); % y-coordinates of the data-points to be used Ia=trapz(x,y); % the approximate integral calculation given by the Trapezoidal rule Et=abs(It-Ia); %Et(i) is the absolute value of the resulting true error when we use h(i). fprintf(' %3.12f %5.10f %5.15f %5.15f \n',h,It,Ia,Et)

16

%Plot the |true error| versus the step size as shown below. %Use logarithmic scale for both the x and y axes to get a better view. loglog(h,Et,'o'), xlabel('h (step size)'), ylabel('|True Error| Et'), hold on end hold off

The output of the program is: How many times would you like to multiply the number of segments by 5?: 14

step size true value estimation |true error|

1.570796326795 2.0000000000 1.570796326794897 0.429203673205103

0.314159265359 2.0000000000 1.983523537509455 0.016476462490545

0.062831853072 2.0000000000 1.999341983076262 0.000658016923738

0.012566370614 2.0000000000 1.999973680985661 0.000026319014339

0.002513274123 2.0000000000 1.999998947242086 0.000001052757914

0.000502654825 2.0000000000 1.999999957889688 0.000000042110312

0.000100530965 2.0000000000 1.999999998315588 0.000000001684412

0.000020106193 2.0000000000 1.999999999932622 0.000000000067378

0.000004021239 2.0000000000 1.999999999997309 0.000000000002691

0.000000804248 2.0000000000 1.999999999999894 0.000000000000106

0.000000160850 2.0000000000 2.000000000000011 0.000000000000011

0.000000032170 2.0000000000 2.000000000000038 0.000000000000038

Out of memory. Type HELP MEMORY for your options.

Error in trapz (line 68)

In the tabulated results, the |True Error| is composed of both truncation and round-off errors. |True Error| decreases as the step size is decreased until ℎ ≅ 1.6 × 10 since the reduction in truncation error is more dominant than the increase in round-off error. For smaller step sizes (or higher number of segments), the round-off error becomes dominant since the number of operations required by the computer increase drastically; as a result the |True Error| starts to increase. As shown in the tabulated results, the number of segments could only be increased by a factor of 5^11=48828125 since for larger number of segments, the memory required by Matlab exceeds the amount of available memory of the computer. Therefore, it can be concluded that there is no need to use extremely strict (i.e. small) error tolerances in numerical approximations. Example9: Using Taylor series expansion, we can obtain a formula for the first derivative of a function as given below. Note that 푥 = 푥 + ℎ and 푥 = 푥 − ℎ.

푓 (푥 ) =푓(푥 )− 푓(푥 )

2ℎ −푓( )(휉)

6 ℎ 퐸푞푛. 1

17

Here, 푓 (푥 ) can be calculated approximately as 푓 (푥 ) ≅ ( ) ( ) . Then the truncation error for this

approximation becomes −( )( ) ℎ . However, when we use the formula ( ) ( ) to estimate 푓 (푥 ),

we will normally have round-off errors as well. First of all, 푓(푥 ) and 푓(푥 ) will be stored in the computer as 푓푙 푓(푥 ) and 푓푙 푓(푥 ) , respectively. Therefore, we will have

푓(푥 ) = 푓푙 푓(푥 ) + 푒 , 푓(푥 ) = 푓푙 푓(푥 ) + 푒 where 푒 and 푒 are the individual round-off errors. The step size ℎ is chosen to be small and we typically have 0 < ℎ ≤ 1, hence 푓(푥 ) and 푓(푥 ) are of the same order of magnitude which means additional round-off error due to computer subtraction (i.e. 푓푙 푓(푥 ) − 푓푙 푓(푥 ) ) is much smaller than 푒 and 푒 , and can be neglected. Besides, we will have additional round-off error when we divide 푓푙 푓(푥 ) − 푓푙 푓(푥 ) by 2ℎ. As a result of this division, we will have a quotient which must be

rounded to yield some round-off error and for ℎ = 1 this round-off error is 훿 and |훿| ≤ 2 × 2 when double-precision is used. Let 훿∗ be the upper bound for |훿|, then 훿∗ = 2 × 2 . Therefore, for other values of ℎ, the upper bound for the absolute value of round-off error due to computer division as mentioned, will be equal to 훿∗/ℎ as the amount of round-off error is directly proportional to the magnitude of the quotient. Finally, we obtain the equation given below.

푓 (푥 ) = 푓푙푓(푥 )− 푓(푥 )

2ℎ + 푒 − 푒

2ℎ +훿ℎ −

푓( )(휉)6 ℎ 퐸푞푛. 2

In Eqn.2, 푓 (푥 ) is the true value. In the right-hand side of Eqn.2, the first term is the approximation of 푓 (푥 ) by the computer, the second term is the round-off error and the third term is the truncation error. In this simple theoretical analysis, we will study how the round-off and truncation errors vary with respect to the step size ℎ. Therefore, we disregard the round-off error incurred due to the storing of ℎ in the computer. Eqn.2 indicates that round-off error is inversely proportional to ℎ whereas the truncation error is directly proportional to ℎ. Notice that 푥 and 푥 change as we change the step size. We need to make further simplifications in our analysis of Eqn.2. Assume that the absolute values of 푒 and 푒 have an upper bound of 휀, thus (푒 − 푒 ) has a maximum absolute value of 2휀. Also assume that 푓( )(휉) has a maximum absolute value of 푀. Then the upper bound for the absolute value of the total error can be obtained from the inequality given below.

|푇표푡푎푙 푒푟푟표푟| = 푓 (푥 )− 푓푙푓(푥 )− 푓(푥 )

2ℎ ≤휀ℎ +

훿∗

ℎ +ℎ 푀

6 퐸푞푛. 3 The upper bound for the absolute value of the total error can be minimised with respect to the step size to get an estimation for the optimal step size ℎ as given below.

푑푑ℎ

휀 + 훿∗

ℎ +ℎ 푀

6 = 0 ⟹ ℎ = 3(휀 + 훿∗)

푀 /

퐸푞푛. 4

Example10: Let's test the theoretical results found in Example9 against a numerical analysis performed in Matlab. Consider the function 푓(푥) = −0.1푥 − 0.15푥 − 0.5푥 − 0.25푥 + 1.2 . We would like to estimate 푓 (0.5) using the formula 푓 (푥 ) ≅ [ 푓(푥 )− 푓(푥 ) ] / 2ℎ which is obtained from Eqn.1. We will start with ℎ = 1 and then progressively divide the step size by 10 to investigate how the total error in our estimation changes as we decrease the step size ℎ. Since 푓′(푥) = −0.4푥 − 0.45푥 − 푥 − 0.25, the true value is 푓 (0.5) = −0.9125 . A function m-file in Matlab can be written as given below. function totalerror_demo(fx,fdx,x0,h0,n) % fx corresponds to the function f(x) % fdx corresponds to the true derivative f'(x) % fx and fdx will be defined using function handles with the operator @

18

% We would like to estimate the value of f'(x0) using different step sizes. % h0 is the initial value of the step size. % We will progressively divide h by 10 for n times. % In this program, the true value of f' at x=x0 is given by fdx(x0) fprintf(' step size true value estimation |true error| \n') for i=1:(n+1) h(i)=h0/10^(i-1); %h is the array of step sizes to be used fde(i)=( fx(x0+h(i))-fx(x0-h(i)) )/(2*h(i)); %fde(i) is the estimation of f'(x0) using the step size h(i) Et(i)=abs(fdx(x0)-fde(i)); %Et(i) is the absolute value of the resulting true error when we use h(i) fprintf(' %3.10f %5.10f %5.15f %5.15f \n',h(i),fdx(x0),fde(i),Et(i)) end %Plot the |true error| versus the step size as shown below. %Use logarithmic scale for both the x and y axes to get a better view. loglog(h,Et,'o'), xlabel('h (step size)'), ylabel('|True Error| Et') end

Now, run this m-file by typing the following statements in the Command Window of Matlab. >> f=@(x) -0.1*x.^4-0.15*x.^3-0.5*x.^2-0.25*x+1.2;

>> fd=@(x) -0.4*x.^3-0.45*x.^2-x-0.25;

>> totalerror_demo(f,fd,0.5,1,10) The results are given below. step size true value estimation |true error|

1.0000000000 -0.9125000000 -1.262500000000000 0.350000000000000

0.1000000000 -0.9125000000 -0.915999999999999 0.003499999999999

0.0100000000 -0.9125000000 -0.912534999999998 0.000034999999998

0.0010000000 -0.9125000000 -0.912500350000012 0.000000350000012

0.0001000000 -0.9125000000 -0.912500003499850 0.000000003499850

0.0000100000 -0.9125000000 -0.912500000033178 0.000000000033178

0.0000010000 -0.9125000000 -0.912500000005423 0.000000000005423

0.0000001000 -0.9125000000 -0.912499999450311 0.000000000549689

0.0000000100 -0.9125000000 -0.912500003336092 0.000000003336092

0.0000000010 -0.9125000000 -0.912500019989437 0.000000019989437

0.0000000001 -0.9125000000 -0.912500075500589 0.000000075500589

19

In the tabulated results, the |true error| is equal to the |푇표푡푎푙 푒푟푟표푟| given in Eqn.3 and it is composed of both truncation and round-off errors. Looking at the error terms in Eqn.3 and the tabulated results, the |푇표푡푎푙 푒푟푟표푟| is dominated by the truncation error ℎ 푓( )(휉)/6 for 10 < ℎ ≤ 1 thus the |푇표푡푎푙 푒푟푟표푟| is reduced by a factor of 100 when we divide the step size by 10. Starting from ℎ = 10 , round-off error begins to influence the |푇표푡푎푙 푒푟푟표푟|. At ℎ = 10 , the |푇표푡푎푙 푒푟푟표푟| becomes minimum. For ℎ < 10 , round-off error becomes dominant since the |푇표푡푎푙 푒푟푟표푟| increase as the step size is decreased. We should now estimate the value of ℎ which we have found as a result of our theoretical study given in Eqn.4. Remember that ℎ is the value of the step size which minimises the upper bound for the absolute value of the total error. In Eqn.4, we can approximate 푀 as 푀 = 푓( )(0.5) = 2.1 . The value of 휀 is approximately equal to eps(0.5)/2=0.555e-16 in Matlab. 훿∗ can be estimated as 훿∗ = 2 × 2 since the absolute value of the computed derivatives are around 1. Then 훿∗ = 2 ≅ 1.11 × 10 . Consequently, ℎ = 6.196 × 10 which is of the same order as the result ℎ = 10 obtained from our numerical analysis performed in Matlab.

approximations and errorsyunus.hacettepe.edu.tr/~s.himmetoglu/notes1.pdf · 2 taylor series taylor...

Documents