information representation (level isa3) floating point numbers
TRANSCRIPT
Information Representation (Level ISA3)
Floating point numbers
Floating point storage
• Two’s complement notation can’t be used to represent floating point numbers because there is no provision for the radix point
• A form of scientific notation is used
Floating point storage
• Since the radix is 2, we use a binary point instead of a decimal point
• The bits allocated for storing the number are broken into 3 parts: a sign bit, an exponent, and a mantissa (aka significand)
Floating point storage
• The number of bits used for exponent and significand depend on what we want to optimize for:– For greater range of magnitude, we would
allocate more bits to the exponent– For greater precision, we would allocate more
bits to the significand
• For the next several slides, we’ll use a 14-bit model with a sign bit, a 5-bit exponent, and an 8-bit significand
Example
• Suppose we want to store the number 17 using our 14-bit model:– Using decimal scientific notation, 17 is:
17.0 x 100 or1.7 x 101 or.17 x 102
– In binary, 17 is 10001, which is:10001.0 x 20 or1000.1 x 21 or100.01 x 22 or, finally,.10001 x 25
– We will use this last value for our model:
0 0 0 1 0 1 1 0 0 0 1 0 0 0
sign exponent significand
Negative exponents
• With the existing model, we can only represent numbers with positive exponents
• Two possible solutions:– Reserve a sign bit for the exponent (with a
resulting loss in range)– Use a biased exponent; explanation on next
slide
Bias values
• With bias values, every integer in a given range is converted into a non-negative by adding an offset (bias) to every value to be stored
• The bias value should be at or near the middle of the range of numbers– Since we are using 5 bits for the exponent, the range of values is
0 .. 31– If we choose 16 as our bias value, we can consider every number
less than 16 a negative number, every number greater than 16 a positive number, and 16 itself can represent 0
– This is called excess-16 representation, since we have to subtract 16 from the stored value to get the actual value
– Note: exponents of all 0s or all 1s are usually reserved for special numbers (such as 0 or infinity)
Examples
Using excess-16 notation, our initial example (1710) becomes:
0 0 0 1 0 1 1 0 0 0 1 0 0 0
sign exponent significand
If we wanted to store 0.2510 = 1.0 x 2-2 we would have:
0 0 1 1 1 0 1 0 0 0 0 0 0 0
sign exponent significand
Synonymous forms
• There is still one problem with this representation scheme; there are several synonymous forms for a particular value
• For example, all of the following could represent 17:
0 1 0 1 0 1 1 0 0 0 1 0 0 0
sign exponent significand
0 1 0 1 1 0 0 1 0 0 0 1 0 0
sign exponent significand
0 1 0 1 1 1 0 0 1 0 0 0 1 0
sign exponent significand
0 1 1 0 0 0 0 0 0 1 0 0 0 1
sign exponent significand
Normalization
• In order to avoid the potential chaos of synonymous forms, a convention has been established that the leftmost bit of the significand will always be 1
• An additional advantage of this convention is that, since we always know the 1 is supposed to be present, we never need to actually store it; thus we get an extra bit of precision for the significand; your text refers to this as the hidden bit
ExampleExpress 0.0312510 in normalized floating-point form with excess-16 bias
0.0312510 = 0.000012 x 20 = 1.0 x 2-5
Applying the bias, the exponent field is 16 – 5 = 11
0 0 1 1 0 0 0 0 0 0 0 0 0 0
sign exponent significand
Floating-point arithmetic
• If we wanted to add decimal numbers expressed in scientific notation, we would change one of the numbers so that both of them are expressed in the same power of the base
• Example:1.5 x 102 + 3.5 x 103 = .15 x 103 + 3.5 x 103 =
3.65 x 103
Floating-point arithmeticExample: add the following binary numbers as represented in normalized 14-bit format with a bias of 16:
0 1 0 0 1 0 1 1 0 0 1 0 0 0
0 1 0 0 0 0 1 0 0 1 1 0 1 0
Aligning the operands around the binary point, we have:11.001000
+ 0.10011010 -------------------
11.10111010Renormalizing, retaining the larger exponent and truncating the low-order bit, we have:
0 1 0 0 1 0 1 1 1 0 1 1 1 0
Floating-point errors
• Any two real numbers have an infinite number of values between them
• Computers have a finite amount of memory in which to represent real numbers
• Modeling an infinite system using a finite system means the model is only an approximation– The more bits used, the more accurate the
approximation– Always some element of error, no matter how many
bits used
Floating-point errors
• Errors can be blatant or subtle (or unnoticed)– Blatant errors: overflow or underflow; blatant
because they cause crashes– Subtle errors: can lead to wildly erroneous
results, but are often hard to detect (may go unnoticed until they cause real problems)
Example
• In our simple model, we can express normalized numbers in the range -.111111112 x 215 … .11111111 x 215
– Certain limitations are clear; we know we can’t store 2-19 or 2128
– Not so obvious that we can’t accurately store 256.5:• Binary equivalent is 10000000.1, which is 10 bytes wide• The low-order bit would be dropped (or rounded into next bit) • Either way, we have introduced an error
Error propogation
• The relative error in the previous example can be found by taking the ratio of the absolute value of the error to the true value of the number:
256.5 – 256---------------- = .003906 (about .39%) 256
• In a lengthy calculation, such errors can propogate, causing substantial loss of precision
• The next slide illustrates such a situation; it shows the first five iterations of a floating-point product loop. Eventually the error will be 100%, since the product will go to 0.
Error propogation exampleMultiplier Multiplicand 14-bit product Real product Error1000.001 0.11101000 1110.1001 14.7784 1.46%(16.125) (0.90625) (14.5625)
1110.1001 0.11101000 1101.0011 13.4483 1.94%(14.5625) (13.1885)
1101.0011 0.11101000 1011.1111 12.2380 2.46%(13.1885) (11.9375)
1011.1111 0.11101000 1010.1101 11.1366 2.91%(11.9375) (10.8125)
1010.1101 0.11101000 1001.1100 10.1343 3.79%(10.8125) (9.75)
1001.1100 0.11101000 1000.1101 8.3922 4.44%(9.75) (8.8125)
Special values
• Zero is an example of a special floating-point value– Can’t be normalized; no 1 in its binary version– Usually represented as all 0s in both
exponent and significand regions– It is common to have both positive and
negative versions of 0 in floating-point representation
Real number line with 0 as only special value (with 3-bit
exponent, 4-bit significand)
More special values
• Infinity: bit pattern used when a result falls in the overflow region– Uses exponent field of all 1s, significand of all 0s
• Not a number (NaN): used to indicate illegal floating point operations– Uses exponent of all 1s, any significand
• Use of these bit patterns for special values restricts the range of numbers that can be represented
Denormalized numbers
• There is no value to represent infinity in the underflow region analogous to infinity in the overflow region
• Denormalized numbers exist in the underflow regions, shortening the gap between small positive values and 0
Denormalized numbers
• When the exponent field of a number is all 0s and the significand contains at least a single 1, special rules apply to the representation:– The hidden bit is assumed to be 0 instead of 1– The exponent is stored in excess n-1 (where
excess n is used for normalized numbers)
Floating point storage and C
• In the C programming language, the standard output function, printf(), prints floating-point numbers with a default precision of 6 (7 digits, including the whole part, using either fixed or scientific notation)
• We can reason backwards from this fun fact to discover C’s default size for floating point numbers
Floating point values and C
• To represent some number n with d decimal digits, we need log2n x d bits:– Let L10 = log10(n); then 10L10 = n
– So log2(10L10) = log2(n),
– and L10 * log2(10) = log2(n)
– which means log2(n) = log10(n) * log2(10)
– The last value above (log2(10)) is a constant, about 3.322
• So, the number of bits needed to represent a decimal number with 7 significant digits is 3.322 * 7, or 23.254; 24 bits, in other words
The IEEE 754 floating point standard
• Specifies uniform standard for single and double-precision floating-point numbers
• Single-precision: 32 bits– 23-bit significand– 8-bit using excess 127 bias
• Double-precision: 64 bits– 52-bit significand– 11-bit exponent with excess 1023 bias
Binary-coded decimal representation
• Numeric coding system used mainly in IBM mainframe & midrange systems
• BCD encodes each digit of a decimal number to a 4-bit binary form
• In an 8-bit byte, only the lower nibble represents the digit’s value; the upper nibble (called the zone) is used to hold the sign, which can be positive (1100), negative (1101), or unsigned (1111)
BCD
• Commonly used in banking and other financial settings where:– Most data is monetary– High level of I/O activity
• BCD is easier to convert to decimal for printed reports
• Circuitry for arithmetic operations usually slower than for unsigned binary