information representation (level isa3) floating point numbers

Information Representation (Level ISA3)

Floating point numbers

Floating point storage

• Two’s complement notation can’t be used to represent floating point numbers because there is no provision for the radix point

• A form of scientific notation is used


• Since the radix is 2, we use a binary point instead of a decimal point

• The bits allocated for storing the number are broken into 3 parts: a sign bit, an exponent, and a mantissa (aka significand)


• The number of bits used for exponent and significand depend on what we want to optimize for:– For greater range of magnitude, we would

allocate more bits to the exponent– For greater precision, we would allocate more

bits to the significand

• For the next several slides, we’ll use a 14-bit model with a sign bit, a 5-bit exponent, and an 8-bit significand

Example

• Suppose we want to store the number 17 using our 14-bit model:– Using decimal scientific notation, 17 is:

17.0 x 100 or1.7 x 101 or.17 x 102

– In binary, 17 is 10001, which is:10001.0 x 20 or1000.1 x 21 or100.01 x 22 or, finally,.10001 x 25

– We will use this last value for our model:

0 0 0 1 0 1 1 0 0 0 1 0 0 0

sign exponent significand

Negative exponents

• With the existing model, we can only represent numbers with positive exponents

• Two possible solutions:– Reserve a sign bit for the exponent (with a

resulting loss in range)– Use a biased exponent; explanation on next

slide

Bias values

• With bias values, every integer in a given range is converted into a non-negative by adding an offset (bias) to every value to be stored

• The bias value should be at or near the middle of the range of numbers– Since we are using 5 bits for the exponent, the range of values is

0 .. 31– If we choose 16 as our bias value, we can consider every number

less than 16 a negative number, every number greater than 16 a positive number, and 16 itself can represent 0

– This is called excess-16 representation, since we have to subtract 16 from the stored value to get the actual value

– Note: exponents of all 0s or all 1s are usually reserved for special numbers (such as 0 or infinity)

Examples

Using excess-16 notation, our initial example (1710) becomes:

0 0 0 1 0 1 1 0 0 0 1 0 0 0


If we wanted to store 0.2510 = 1.0 x 2-2 we would have:

0 0 1 1 1 0 1 0 0 0 0 0 0 0


Synonymous forms

• There is still one problem with this representation scheme; there are several synonymous forms for a particular value

• For example, all of the following could represent 17:

0 1 0 1 0 1 1 0 0 0 1 0 0 0


0 1 0 1 1 0 0 1 0 0 0 1 0 0


0 1 0 1 1 1 0 0 1 0 0 0 1 0


0 1 1 0 0 0 0 0 0 1 0 0 0 1


Normalization

• In order to avoid the potential chaos of synonymous forms, a convention has been established that the leftmost bit of the significand will always be 1

• An additional advantage of this convention is that, since we always know the 1 is supposed to be present, we never need to actually store it; thus we get an extra bit of precision for the significand; your text refers to this as the hidden bit

ExampleExpress 0.0312510 in normalized floating-point form with excess-16 bias

0.0312510 = 0.000012 x 20 = 1.0 x 2-5

Applying the bias, the exponent field is 16 – 5 = 11

0 0 1 1 0 0 0 0 0 0 0 0 0 0


Floating-point arithmetic

• If we wanted to add decimal numbers expressed in scientific notation, we would change one of the numbers so that both of them are expressed in the same power of the base

• Example:1.5 x 102 + 3.5 x 103 = .15 x 103 + 3.5 x 103 =

3.65 x 103

Floating-point arithmeticExample: add the following binary numbers as represented in normalized 14-bit format with a bias of 16:

0 1 0 0 1 0 1 1 0 0 1 0 0 0

0 1 0 0 0 0 1 0 0 1 1 0 1 0

Aligning the operands around the binary point, we have:11.001000

+ 0.10011010 -------------------

11.10111010Renormalizing, retaining the larger exponent and truncating the low-order bit, we have:

0 1 0 0 1 0 1 1 1 0 1 1 1 0

Floating-point errors

• Any two real numbers have an infinite number of values between them

• Computers have a finite amount of memory in which to represent real numbers

• Modeling an infinite system using a finite system means the model is only an approximation– The more bits used, the more accurate the

approximation– Always some element of error, no matter how many

bits used

Floating-point errors

• Errors can be blatant or subtle (or unnoticed)– Blatant errors: overflow or underflow; blatant

because they cause crashes– Subtle errors: can lead to wildly erroneous

results, but are often hard to detect (may go unnoticed until they cause real problems)

Example

• In our simple model, we can express normalized numbers in the range -.111111112 x 215 … .11111111 x 215

– Certain limitations are clear; we know we can’t store 2-19 or 2128

– Not so obvious that we can’t accurately store 256.5:• Binary equivalent is 10000000.1, which is 10 bytes wide• The low-order bit would be dropped (or rounded into next bit) • Either way, we have introduced an error

Error propogation

• The relative error in the previous example can be found by taking the ratio of the absolute value of the error to the true value of the number:

256.5 – 256---------------- = .003906 (about .39%) 256

• In a lengthy calculation, such errors can propogate, causing substantial loss of precision

• The next slide illustrates such a situation; it shows the first five iterations of a floating-point product loop. Eventually the error will be 100%, since the product will go to 0.

Error propogation exampleMultiplier Multiplicand 14-bit product Real product Error1000.001 0.11101000 1110.1001 14.7784 1.46%(16.125) (0.90625) (14.5625)

1110.1001 0.11101000 1101.0011 13.4483 1.94%(14.5625) (13.1885)

1101.0011 0.11101000 1011.1111 12.2380 2.46%(13.1885) (11.9375)

1011.1111 0.11101000 1010.1101 11.1366 2.91%(11.9375) (10.8125)

1010.1101 0.11101000 1001.1100 10.1343 3.79%(10.8125) (9.75)

1001.1100 0.11101000 1000.1101 8.3922 4.44%(9.75) (8.8125)

Special values

• Zero is an example of a special floating-point value– Can’t be normalized; no 1 in its binary version– Usually represented as all 0s in both

exponent and significand regions– It is common to have both positive and

negative versions of 0 in floating-point representation

Real number line with 0 as only special value (with 3-bit

exponent, 4-bit significand)

More special values

• Infinity: bit pattern used when a result falls in the overflow region– Uses exponent field of all 1s, significand of all 0s

• Not a number (NaN): used to indicate illegal floating point operations– Uses exponent of all 1s, any significand

• Use of these bit patterns for special values restricts the range of numbers that can be represented

Denormalized numbers

• There is no value to represent infinity in the underflow region analogous to infinity in the overflow region

• Denormalized numbers exist in the underflow regions, shortening the gap between small positive values and 0

Denormalized numbers

• When the exponent field of a number is all 0s and the significand contains at least a single 1, special rules apply to the representation:– The hidden bit is assumed to be 0 instead of 1– The exponent is stored in excess n-1 (where

excess n is used for normalized numbers)

Floating point storage and C

• In the C programming language, the standard output function, printf(), prints floating-point numbers with a default precision of 6 (7 digits, including the whole part, using either fixed or scientific notation)

• We can reason backwards from this fun fact to discover C’s default size for floating point numbers

Floating point values and C

• To represent some number n with d decimal digits, we need log2n x d bits:– Let L10 = log10(n); then 10L10 = n

– So log2(10L10) = log2(n),

– and L10 * log2(10) = log2(n)

– which means log2(n) = log10(n) * log2(10)

– The last value above (log2(10)) is a constant, about 3.322

• So, the number of bits needed to represent a decimal number with 7 significant digits is 3.322 * 7, or 23.254; 24 bits, in other words

The IEEE 754 floating point standard

• Specifies uniform standard for single and double-precision floating-point numbers

• Single-precision: 32 bits– 23-bit significand– 8-bit using excess 127 bias

• Double-precision: 64 bits– 52-bit significand– 11-bit exponent with excess 1023 bias

Binary-coded decimal representation

• Numeric coding system used mainly in IBM mainframe & midrange systems

• BCD encodes each digit of a decimal number to a 4-bit binary form

• In an 8-bit byte, only the lower nibble represents the digit’s value; the upper nibble (called the zone) is used to hold the sign, which can be positive (1100), negative (1101), or unsigned (1111)

BCD

• Commonly used in banking and other financial settings where:– Most data is monetary– High level of I/O activity

• BCD is easier to convert to decimal for printed reports

• Circuitry for arithmetic operations usually slower than for unsigned binary

information representation (level isa3) floating point numbers

Documents