problems with floating-point representations douglas wilhelm harder department of electrical and...

Problems withFloating-Point Representations

Douglas Wilhelm Harder

Department of Electrical and Computer Engineering

University of Waterloo

Copyright © 2007 by Douglas Wilhelm Harder. All rights reserved.

ECE 204 Numerical Methods for Computer Engineers

Problems with Floating-Point Representations

• This topic will cover a number of the problems with using a floating-point representation, including:– underflow and overflow– subtractive cancellation– adding large and small numbers– non-associative (a + b) + c a + (b + c)


Underflow and Overflow

• In our six–decimal-digit floating-point representation, the largest number we can represent is 9.999 1050

• The largest double is 1.8 10308:>> format long; realmax

realmax = 1.79769313486232e+308

>> format hex; realmax

realmax = 7fefffffffffffff

or more correctly, 1.11111111111111111111111111111111111111111111111111112 21023



• Any number larger than these values cannot be represented using these formats

• To solve this problem, we can introduce a floating-point infinity:

>> format long; 2e308

ans = Inf

>> format hex; 2e308

ans = 7ff0000000000000



• The properties of infinity include:– any real plus infinity is infinity– one over infinity is 0– any positive number times infinity is infinity– any negative number times infinity is –infinity

• For example:>> Inf + 1e100 >> 325*Inf

ans = Inf ans = Inf

>> 1/Inf >> -2*Inf

ans = 0 ans = -Inf



• The introduction of a floating-point infinity allows computations to continue and removes the necessity of signaling overflows through exceptions

• An example where infinity may not cause a problem is where its reciprocal is immediately taken:

>> 5 + 1/2e400

ans = 5



• Our six–decimal-digit floating-point representation, the smallest number we can represent is 1.000 10–49

• The smallest positive double (using the normal representation) is 2.2 10–308:

>> format long; realmin

realmax = 2.22507385850720e-308

>> format hex; realmin

realmax = 0010000000000000

or more correctly, 2–1022



• Storing real numbers on a computer:– we must use a fixed amount of memory,– we should be able to represent a wide range

of numbers, both large and small,– we should be able to represent numbers with

a small relative error, and– we should be able to easily test if one number

is greater than, equal to, or less than another



• Any number smaller than these values is represented by 0

• This is represented by a double with all 0s, with the possible exception of the sign bit:

>> format hex; 0

ans = 0000000000000000

>> -0

ans = 8000000000000000

>> format long; 1/0

ans = Inf

>> 1/-0

ans = -Inf



• You may have noticed that we did not use both the largest and smallest exponents:

>> format hex; realmax

realmax = 7fefffffffffffff

>> realmin

realmin = 0010000000000000

• The largest and smallest exponents should have been 7ff and 000, respectively



• These “special” exponents are used to represent special numbers, such as:– infinity 7ff000··· fff000···

– not-a-number 7ff800···

– 0 000000··· 800000···

– denormalized numbers• numbers existing between 0 and realmin, but at

reduced precision



• Thus, we can classify numbers which:– are represented by 0,– are not represented with full precision,– are represented using 53 bits of precision,

and– are represented by infinity


Subtractive Cancellation

• The next problem we will look at deals with subtracting similar numbers

• Suppose we take the difference between and the 3-digit approximation 3.14 using our six-digit floating-point approximation

0493142 0493140

• Performing the calculation: 3.142 – 3.140 = 0.002 = 2.000 10–3

which has the representation 0462000



• How accurate is this difference?

• Recall that the 3.14 is precisely by our floating-point representation, but our representation of has a relative error of 0.00012

• By calculating the difference, of almost-equal numbers, we loose a significant amount of precision



• The actual value of the difference is – 3.14 = 0.001592654···

and therefore, the relative error of our approximation 0.002 of this difference is

• Thus, the relative error which we were trying to calculate is significant: 25.58%

2558.0001592654.0

001592654.0002.0



• Subtractive cancellation is the phenomenon where the subtraction of similar numbers results in a significant reduction in precision



• As another example, recall the definition of the derivative:

• Assuming that this limit converges, then using a smaller and smaller value of h should result in a very good approximation to f(1)(x)

h

xhxx

h

)f()f(lim)(f

0

)1(



• Let’s try this out with f(x) = sin(x) and let us approximate f(1)(1)

• From calculus, we know that the actual derivative is cos(1) = 0.5403023058681397··

• Let us use Matlab to approximate this derivative using h = 0.1, 0.001, 0.0001, ...



>> for i=1:8

h = 10^-i; (sin(1 + h) - sin(1))/h

end

ans = 0.497363752535389

ans = 0.536085981011869

ans = 0.539881480360327

ans = 0.540260231418621

ans = 0.540298098505865

ans = 0.540301885010308

ans = 0.540302264040449

ans = 0.540302291796024



>> for i=8:16

h = 10^-i; (sin(1 + h) - sin(1))/h

end

ans = 0.540302291796024

ans = 0.540302358409406

ans = 0.540302247387103

ans = 0.540301137164079

ans = 0.540345546085064

ans = 0.539568389967826

ans = 0.532907051820075

ans = 0.555111512312578

ans = 0



• What happened here?

• With h = 10–8, we had an approximation which has a relative error of 2.6 10–8, or7 decimial-digits of precision

• With smaller and smaller values of h, the error, however, increases until we have a completely useless approximation when h = 10–16



• Looking at sin(1 + h) and sin(1) whenh = 10–12

>> h = 1e-12

h = 1.00000000000000e-12

>> sin(1 + h)

ans = 0.841470984808437

>> sin(1)

ans = 0.841470984807897

• Consequently, we are subtracting two numbers which are almost equal



• The next slide shows the bits using h = 2–n for n = 1, 2, ..., 53

• Note that double-precision floating-point numbers have 53 bits of precision

• The red digits show the results are a result of the subtractive cancellation

ans = 0011111111010011111110001001100000110000100011011000001001110100

ans = 0011111111011011100001100000001101111000010000011010010011110000

ans = 0011111111011111001000001011101110110001001110100000111000000000

ans = 0011111111100000011011111110110111010001101110101110100101110000

ans = 0011111111100000110111011011110010010001110111011111011010000000

ans = 0011111111100001000101000001111110010011011101000110100000000000

ans = 0011111111100001001011110010111100111101010111001011000110000000

ans = 0011111111100001001111001010111010000100110110111000100000000000

ans = 0011111111100001010000110110110000000010010010001101100000000000

ans = 0011111111100001010001101100101000110111000011001000110000000000

ans = 0011111111100001010010000111100100101110111001011110000000000000

ans = 0011111111100001010010010101000010100010001011101111000000000000

ans = 0011111111100001010010011011110001011001101010100110000000000000

ans = 0011111111100001010010011111001000110100110111011100000000000000

ans = 0011111111100001010010100000110100100010010101010000000000000000

ans = 0011111111100001010010100001101010011001000010000000000000000000

ans = 0011111111100001010010100010000101010100010111100000000000000000

ans = 0011111111100001010010100010010010110010000010000000000000000000

ans = 0011111111100001010010100010011001100000111000000000000000000000

ans = 0011111111100001010010100010011100111000010000000000000000000000

ans = 0011111111100001010010100010011110100100000000000000000000000000

ans = 0011111111100001010010100010011111011001110000000000000000000000

ans = 0011111111100001010010100010011111110100100000000000000000000000

ans = 0011111111100001010010100010100000000010000000000000000000000000

ans = 0011111111100001010010100010100000001000000000000000000000000000

ans = 0011111111100001010010100010100000001100000000000000000000000000

ans = 0011111111100001010010100010100000010000000000000000000000000000

ans = 0011111111100001010010100010100000010000000000000000000000000000

ans = 0011111111100001010010100010100000000000000000000000000000000000

ans = 0011111111100001010010100010100000000000000000000000000000000000

ans = 0011111111100001010010100010100000000000000000000000000000000000

ans = 0011111111100001010010100010100000000000000000000000000000000000

ans = 0011111111100001010010100010100000000000000000000000000000000000

ans = 0011111111100001010010100010100000000000000000000000000000000000

ans = 0011111111100001010010100010100000000000000000000000000000000000

ans = 0011111111100001010010100010000000000000000000000000000000000000

ans = 0011111111100001010010100010000000000000000000000000000000000000

ans = 0011111111100001010010100000000000000000000000000000000000000000

ans = 0011111111100001010010100000000000000000000000000000000000000000

ans = 0011111111100001010010100000000000000000000000000000000000000000

ans = 0011111111100001010010100000000000000000000000000000000000000000

ans = 0011111111100001010010000000000000000000000000000000000000000000

ans = 0011111111100001010010000000000000000000000000000000000000000000

ans = 0011111111100001010000000000000000000000000000000000000000000000

ans = 0011111111100001010000000000000000000000000000000000000000000000

ans = 0011111111100001010000000000000000000000000000000000000000000000

ans = 0011111111100001000000000000000000000000000000000000000000000000

ans = 0011111111100001000000000000000000000000000000000000000000000000

ans = 0011111111100000000000000000000000000000000000000000000000000000

ans = 0011111111100000000000000000000000000000000000000000000000000000

ans = 0011111111100000000000000000000000000000000000000000000000000000

ans = 0011111111100000000000000000000000000000000000000000000000000000

ans = 0000000000000000000000000000000000000000000000000000000000000000

• Approximating the derivative of sin(x) at x = 1:– green digits show

accuracy, while– red digits show loss

of precision>> for i=1:53

h = 2^-i;

(sin(1 + h) - sin(1))/h

end



• Later in this course, we will find a formula which will approximate the derivative of sin(x) at x = 1 using h = 0.001 by

0.540302305868125

which is significantly closer tocos(1) = 0.540302305868140 than any approximation we saw before



• Thus, we cannot simply use the formulae covered in Calculus to calculate numbers numerically

• We will now see how an algebraic formula you learned in high-school can also fail:– the quadratic equation

a

acbb

2

42



• Rather than using doubles, we will use our six-digit floating-point numbers to show how the quadratic formula can fail

• Suppose we wish to find the smaller root of the quadratic equation

0.05231 x2 + 7.539 x + 0.1094

• This equation has roots at

x = –144.1070702, x = –0.01451266977



• Using four decimal-digits of precision for each calculation, we find that our approximation to the smaller of the two roots is x = –0.009560

• The relative error of this approximation is 0.3411, or 34%



• Approximating the larger of the two roots, we get x = –144.2

• The relative error of this approximation is only 0.0006449, or 0.0645%

• Why does one formula work so well while the other fails so miserably?



• Stepping through the calculation: b = 7.539

b2 = 56.84

4ac = 0.02289

b2 – 4ac = 56.82

• The actual value is –0.0015183155···

7.53842 acb

0.00142 acbb


Non-Associativity

• Normally, the operations of addition and multiplication are associative, that is:

(a + b) + c = a + (b + c)

(ab)c = a(bc)

• Unfortunately, floating-point numbers are not associative

• If we add a large number to a small number, the large number dominates:

5592. + 0.5923 = 5593.


Non-Associativity

• Consider the example0.005312 + 54.73 – 54.39

• If we calculate the first sum first: (0.005312 + 54.73) – 54.39 = 54.74 – 54.39

= 0.35

• If we calculate the second sum first: 0.005312 + (54.73 – 54.39) = 0.05312 + 0.34

= 0.3453


Order of Operations

• Consider calculating the following sum in Matlab:

• The correct answer is answer, to 20 decimal-digits of precision, is

14.392726722865723632

1000000

1

1

n n


Order of Operations

• Adding the numbers in the natural order, from 1 to 106, we get the following result:

14.3927267228648

• Adding the number in the reverse order, we get the result

14.3927267228658

• The second result is off by only the last digit (and only by 0.76)


Order of Operations

• To see why this happens, consider decimal floating-point model which stores only four decimal-digits of precision:

52.37 + 0.004291 + 0.0009023

• Adding from left to right, we get: (52.37 + 0.004291) + 0.0009023

= 52.37 + 0.0009023

= 52.37


Order of Operations

• Adding the expression from right to left, we get:

52.37 + (0.004291 + 0.0009023)

= 52.37 + 0.005193

= 52.38

• This second value has a smaller relative error when compared to the correct answer (if we keep all precision) of 52.3751933

Usage Notes

• These slides are made publicly available on the web for anyone to use

• If you choose to use them, or a part thereof, for a course at another institution, I ask only three things:– that you inform me that you are using the slides,

– that you acknowledge my work, and

– that you alert me of any mistakes which I made or changes which you make, and allow me the option of incorporating such changes (with an acknowledgment) in my set of slides

Sincerely,

Douglas Wilhelm Harder, MMath

[email protected]

problems with floating-point representations douglas wilhelm harder department of electrical and...

Documents