problems with floating-point representations douglas wilhelm harder department of electrical and...
TRANSCRIPT
Problems withFloating-Point Representations
Douglas Wilhelm Harder
Department of Electrical and Computer Engineering
University of Waterloo
Copyright © 2007 by Douglas Wilhelm Harder. All rights reserved.
ECE 204 Numerical Methods for Computer Engineers
Problems with Floating-Point Representations
• This topic will cover a number of the problems with using a floating-point representation, including:– underflow and overflow– subtractive cancellation– adding large and small numbers– non-associative (a + b) + c a + (b + c)
Problems with Floating-Point Representations
Underflow and Overflow
• In our six–decimal-digit floating-point representation, the largest number we can represent is 9.999 1050
• The largest double is 1.8 10308:>> format long; realmax
realmax = 1.79769313486232e+308
>> format hex; realmax
realmax = 7fefffffffffffff
or more correctly, 1.11111111111111111111111111111111111111111111111111112 21023
Problems with Floating-Point Representations
Underflow and Overflow
• Any number larger than these values cannot be represented using these formats
• To solve this problem, we can introduce a floating-point infinity:
>> format long; 2e308
ans = Inf
>> format hex; 2e308
ans = 7ff0000000000000
Problems with Floating-Point Representations
Underflow and Overflow
• The properties of infinity include:– any real plus infinity is infinity– one over infinity is 0– any positive number times infinity is infinity– any negative number times infinity is –infinity
• For example:>> Inf + 1e100 >> 325*Inf
ans = Inf ans = Inf
>> 1/Inf >> -2*Inf
ans = 0 ans = -Inf
Problems with Floating-Point Representations
Underflow and Overflow
• The introduction of a floating-point infinity allows computations to continue and removes the necessity of signaling overflows through exceptions
• An example where infinity may not cause a problem is where its reciprocal is immediately taken:
>> 5 + 1/2e400
ans = 5
Problems with Floating-Point Representations
Underflow and Overflow
• Our six–decimal-digit floating-point representation, the smallest number we can represent is 1.000 10–49
• The smallest positive double (using the normal representation) is 2.2 10–308:
>> format long; realmin
realmax = 2.22507385850720e-308
>> format hex; realmin
realmax = 0010000000000000
or more correctly, 2–1022
Problems with Floating-Point Representations
Underflow and Overflow
• Storing real numbers on a computer:– we must use a fixed amount of memory,– we should be able to represent a wide range
of numbers, both large and small,– we should be able to represent numbers with
a small relative error, and– we should be able to easily test if one number
is greater than, equal to, or less than another
Problems with Floating-Point Representations
Underflow and Overflow
• Any number smaller than these values is represented by 0
• This is represented by a double with all 0s, with the possible exception of the sign bit:
>> format hex; 0
ans = 0000000000000000
>> -0
ans = 8000000000000000
>> format long; 1/0
ans = Inf
>> 1/-0
ans = -Inf
Problems with Floating-Point Representations
Underflow and Overflow
• You may have noticed that we did not use both the largest and smallest exponents:
>> format hex; realmax
realmax = 7fefffffffffffff
>> realmin
realmin = 0010000000000000
• The largest and smallest exponents should have been 7ff and 000, respectively
Problems with Floating-Point Representations
Underflow and Overflow
• These “special” exponents are used to represent special numbers, such as:– infinity 7ff000··· fff000···
– not-a-number 7ff800···
– 0 000000··· 800000···
– denormalized numbers• numbers existing between 0 and realmin, but at
reduced precision
Problems with Floating-Point Representations
Underflow and Overflow
• Thus, we can classify numbers which:– are represented by 0,– are not represented with full precision,– are represented using 53 bits of precision,
and– are represented by infinity
Problems with Floating-Point Representations
Subtractive Cancellation
• The next problem we will look at deals with subtracting similar numbers
• Suppose we take the difference between and the 3-digit approximation 3.14 using our six-digit floating-point approximation
0493142 0493140
• Performing the calculation: 3.142 – 3.140 = 0.002 = 2.000 10–3
which has the representation 0462000
Problems with Floating-Point Representations
Subtractive Cancellation
• How accurate is this difference?
• Recall that the 3.14 is precisely by our floating-point representation, but our representation of has a relative error of 0.00012
• By calculating the difference, of almost-equal numbers, we loose a significant amount of precision
Problems with Floating-Point Representations
Subtractive Cancellation
• The actual value of the difference is – 3.14 = 0.001592654···
and therefore, the relative error of our approximation 0.002 of this difference is
• Thus, the relative error which we were trying to calculate is significant: 25.58%
2558.0001592654.0
001592654.0002.0
Problems with Floating-Point Representations
Subtractive Cancellation
• Subtractive cancellation is the phenomenon where the subtraction of similar numbers results in a significant reduction in precision
Problems with Floating-Point Representations
Subtractive Cancellation
• As another example, recall the definition of the derivative:
• Assuming that this limit converges, then using a smaller and smaller value of h should result in a very good approximation to f(1)(x)
h
xhxx
h
)f()f(lim)(f
0
)1(
Problems with Floating-Point Representations
Subtractive Cancellation
• Let’s try this out with f(x) = sin(x) and let us approximate f(1)(1)
• From calculus, we know that the actual derivative is cos(1) = 0.5403023058681397··
• Let us use Matlab to approximate this derivative using h = 0.1, 0.001, 0.0001, ...
Problems with Floating-Point Representations
Subtractive Cancellation
>> for i=1:8
h = 10^-i; (sin(1 + h) - sin(1))/h
end
ans = 0.497363752535389
ans = 0.536085981011869
ans = 0.539881480360327
ans = 0.540260231418621
ans = 0.540298098505865
ans = 0.540301885010308
ans = 0.540302264040449
ans = 0.540302291796024
Problems with Floating-Point Representations
Subtractive Cancellation
>> for i=8:16
h = 10^-i; (sin(1 + h) - sin(1))/h
end
ans = 0.540302291796024
ans = 0.540302358409406
ans = 0.540302247387103
ans = 0.540301137164079
ans = 0.540345546085064
ans = 0.539568389967826
ans = 0.532907051820075
ans = 0.555111512312578
ans = 0
Problems with Floating-Point Representations
Subtractive Cancellation
• What happened here?
• With h = 10–8, we had an approximation which has a relative error of 2.6 10–8, or7 decimial-digits of precision
• With smaller and smaller values of h, the error, however, increases until we have a completely useless approximation when h = 10–16
Problems with Floating-Point Representations
Subtractive Cancellation
• Looking at sin(1 + h) and sin(1) whenh = 10–12
>> h = 1e-12
h = 1.00000000000000e-12
>> sin(1 + h)
ans = 0.841470984808437
>> sin(1)
ans = 0.841470984807897
• Consequently, we are subtracting two numbers which are almost equal
Problems with Floating-Point Representations
Subtractive Cancellation
• The next slide shows the bits using h = 2–n for n = 1, 2, ..., 53
• Note that double-precision floating-point numbers have 53 bits of precision
• The red digits show the results are a result of the subtractive cancellation
ans = 0011111111010011111110001001100000110000100011011000001001110100
ans = 0011111111011011100001100000001101111000010000011010010011110000
ans = 0011111111011111001000001011101110110001001110100000111000000000
ans = 0011111111100000011011111110110111010001101110101110100101110000
ans = 0011111111100000110111011011110010010001110111011111011010000000
ans = 0011111111100001000101000001111110010011011101000110100000000000
ans = 0011111111100001001011110010111100111101010111001011000110000000
ans = 0011111111100001001111001010111010000100110110111000100000000000
ans = 0011111111100001010000110110110000000010010010001101100000000000
ans = 0011111111100001010001101100101000110111000011001000110000000000
ans = 0011111111100001010010000111100100101110111001011110000000000000
ans = 0011111111100001010010010101000010100010001011101111000000000000
ans = 0011111111100001010010011011110001011001101010100110000000000000
ans = 0011111111100001010010011111001000110100110111011100000000000000
ans = 0011111111100001010010100000110100100010010101010000000000000000
ans = 0011111111100001010010100001101010011001000010000000000000000000
ans = 0011111111100001010010100010000101010100010111100000000000000000
ans = 0011111111100001010010100010010010110010000010000000000000000000
ans = 0011111111100001010010100010011001100000111000000000000000000000
ans = 0011111111100001010010100010011100111000010000000000000000000000
ans = 0011111111100001010010100010011110100100000000000000000000000000
ans = 0011111111100001010010100010011111011001110000000000000000000000
ans = 0011111111100001010010100010011111110100100000000000000000000000
ans = 0011111111100001010010100010100000000010000000000000000000000000
ans = 0011111111100001010010100010100000001000000000000000000000000000
ans = 0011111111100001010010100010100000001100000000000000000000000000
ans = 0011111111100001010010100010100000010000000000000000000000000000
ans = 0011111111100001010010100010100000010000000000000000000000000000
ans = 0011111111100001010010100010100000000000000000000000000000000000
ans = 0011111111100001010010100010100000000000000000000000000000000000
ans = 0011111111100001010010100010100000000000000000000000000000000000
ans = 0011111111100001010010100010100000000000000000000000000000000000
ans = 0011111111100001010010100010100000000000000000000000000000000000
ans = 0011111111100001010010100010100000000000000000000000000000000000
ans = 0011111111100001010010100010100000000000000000000000000000000000
ans = 0011111111100001010010100010000000000000000000000000000000000000
ans = 0011111111100001010010100010000000000000000000000000000000000000
ans = 0011111111100001010010100000000000000000000000000000000000000000
ans = 0011111111100001010010100000000000000000000000000000000000000000
ans = 0011111111100001010010100000000000000000000000000000000000000000
ans = 0011111111100001010010100000000000000000000000000000000000000000
ans = 0011111111100001010010000000000000000000000000000000000000000000
ans = 0011111111100001010010000000000000000000000000000000000000000000
ans = 0011111111100001010000000000000000000000000000000000000000000000
ans = 0011111111100001010000000000000000000000000000000000000000000000
ans = 0011111111100001010000000000000000000000000000000000000000000000
ans = 0011111111100001000000000000000000000000000000000000000000000000
ans = 0011111111100001000000000000000000000000000000000000000000000000
ans = 0011111111100000000000000000000000000000000000000000000000000000
ans = 0011111111100000000000000000000000000000000000000000000000000000
ans = 0011111111100000000000000000000000000000000000000000000000000000
ans = 0011111111100000000000000000000000000000000000000000000000000000
ans = 0000000000000000000000000000000000000000000000000000000000000000
• Approximating the derivative of sin(x) at x = 1:– green digits show
accuracy, while– red digits show loss
of precision>> for i=1:53
h = 2^-i;
(sin(1 + h) - sin(1))/h
end
Problems with Floating-Point Representations
Subtractive Cancellation
• Later in this course, we will find a formula which will approximate the derivative of sin(x) at x = 1 using h = 0.001 by
0.540302305868125
which is significantly closer tocos(1) = 0.540302305868140 than any approximation we saw before
Problems with Floating-Point Representations
Subtractive Cancellation
• Thus, we cannot simply use the formulae covered in Calculus to calculate numbers numerically
• We will now see how an algebraic formula you learned in high-school can also fail:– the quadratic equation
a
acbb
2
42
Problems with Floating-Point Representations
Subtractive Cancellation
• Rather than using doubles, we will use our six-digit floating-point numbers to show how the quadratic formula can fail
• Suppose we wish to find the smaller root of the quadratic equation
0.05231 x2 + 7.539 x + 0.1094
• This equation has roots at
x = –144.1070702, x = –0.01451266977
Problems with Floating-Point Representations
Subtractive Cancellation
• Using four decimal-digits of precision for each calculation, we find that our approximation to the smaller of the two roots is x = –0.009560
• The relative error of this approximation is 0.3411, or 34%
Problems with Floating-Point Representations
Subtractive Cancellation
• Approximating the larger of the two roots, we get x = –144.2
• The relative error of this approximation is only 0.0006449, or 0.0645%
• Why does one formula work so well while the other fails so miserably?
Problems with Floating-Point Representations
Subtractive Cancellation
• Stepping through the calculation: b = 7.539
b2 = 56.84
4ac = 0.02289
b2 – 4ac = 56.82
• The actual value is –0.0015183155···
7.53842 acb
0.00142 acbb
Problems with Floating-Point Representations
Non-Associativity
• Normally, the operations of addition and multiplication are associative, that is:
(a + b) + c = a + (b + c)
(ab)c = a(bc)
• Unfortunately, floating-point numbers are not associative
• If we add a large number to a small number, the large number dominates:
5592. + 0.5923 = 5593.
Problems with Floating-Point Representations
Non-Associativity
• Consider the example0.005312 + 54.73 – 54.39
• If we calculate the first sum first: (0.005312 + 54.73) – 54.39 = 54.74 – 54.39
= 0.35
• If we calculate the second sum first: 0.005312 + (54.73 – 54.39) = 0.05312 + 0.34
= 0.3453
Problems with Floating-Point Representations
Order of Operations
• Consider calculating the following sum in Matlab:
• The correct answer is answer, to 20 decimal-digits of precision, is
14.392726722865723632
1000000
1
1
n n
Problems with Floating-Point Representations
Order of Operations
• Adding the numbers in the natural order, from 1 to 106, we get the following result:
14.3927267228648
• Adding the number in the reverse order, we get the result
14.3927267228658
• The second result is off by only the last digit (and only by 0.76)
Problems with Floating-Point Representations
Order of Operations
• To see why this happens, consider decimal floating-point model which stores only four decimal-digits of precision:
52.37 + 0.004291 + 0.0009023
• Adding from left to right, we get: (52.37 + 0.004291) + 0.0009023
= 52.37 + 0.0009023
= 52.37
Problems with Floating-Point Representations
Order of Operations
• Adding the expression from right to left, we get:
52.37 + (0.004291 + 0.0009023)
= 52.37 + 0.005193
= 52.38
• This second value has a smaller relative error when compared to the correct answer (if we keep all precision) of 52.3751933
Usage Notes
• These slides are made publicly available on the web for anyone to use
• If you choose to use them, or a part thereof, for a course at another institution, I ask only three things:– that you inform me that you are using the slides,
– that you acknowledge my work, and
– that you alert me of any mistakes which I made or changes which you make, and allow me the option of incorporating such changes (with an acknowledgment) in my set of slides
Sincerely,
Douglas Wilhelm Harder, MMath