floating point arithmetic sun

8/8/2019 Floating Point Arithmetic Sun

1/74

2550 Garcia Avenu eMoun tain View, CA 94043U.S.A.

What Every Computer Scientist Should

Know About Floating-Point Arithmetic

Part No : 800-7895-10Revision A, June 1992


2/74

Please

Recycle

1994 Sun Microsystems, Inc.2550 Garcia Aven ue , Mou nt ain View, Californ ia 94043-1100 U.S.A.

All rights reserved. This produ ct and related d ocumentation are protected by copyright and distributed u nd er licenses

restricting its use, copying, distribution, and d ecomp ilation. No part of this produ ct or related d ocumentation may be

reprod uced in any form by an y means w ithout pr ior written auth orization of Sun and its licensors, if any.

Portions of this produ ct may be derived from the UNIX and Berkeley 4.3 BSD systems, licensed from UN IX System

Laborator ies, Inc., a wh olly own ed su bsidiary of Novell, Inc., and th e University of California, respectively. Third -party font

software in this prod uct is protected by copyright an d licensed from Suns font sup pliers.

RESTRICTED RIGHTS LEGEND: Use, du plication, or d isclosure by the Un ited States Govern men t is sub ject to the restrictions

set forth in DFARS 252.227-7013 (c)(1)(ii) and FAR 52.227-19.

The produ ct described in this man ual may be protected by on e or more U.S. patents, foreign p atents, or pending app lications.

TRADEMARKS

Sun, the Sun logo, Sun Microsystems, Sun Microsystems Comp uter Corporation, Solaris, are trademarks or registered

trad emar ks of Sun Microsystems, Inc. in the U.S. and certain other cou ntries. UNIX is a registered tr ad emar k of Nov ell, Inc., in

the United States and other countries; X/ Open Comp any, Ltd., is the exclusive licensor of such tr ademar k. OPEN LOOK is a

registered tra dem ark of Nov ell, Inc. PostScript an d Disp lay PostScript are trad emar ks of Adobe Systems, Inc. All other

prod uct names mentioned herein are the tradem arks of their respective owners.

All SPARC trad emar ks, includ ing the SCD Comp liant Logo, are trad emar ks or registered trad emar ks of SPARC Internation al,

Inc. SPARCsta tion , SPARCser ver, SPARCen gine , SPARCsto rag e, SPARCw are, SPARCcenter, SPARCclassic, SPARCclus ter,

SPARCd esign , SPARC811, SPARCp rin ter, UltraSPARC, microSPARC, SPARCw ork s, and SPARCom piler are license d

exclusively to Sun Microsystems, Inc. Products bearing SPARC trademar ks are based up on an architecture d eveloped by Sun

Microsystems, Inc.

The OPEN LOOKand Sun Grap hical User Interfaces were develop ed by Sun M icrosystems, Inc. for its users and licensees.

Sun acknow ledges the p ioneering efforts of Xerox in researching an d developing th e concept of visual or grap hical user

interfaces for the compu ter ind ustr y. Sun hold s a non -exclusive license from Xerox to the Xerox Graph ical User Interface,

wh ich license also covers Sun s licensees who imp lement OPEN LOO K GUIs and other wise comp ly with Suns written license

agreements.

X Window System is a prod uct of the Massachu setts Institute of Technology.

THIS PUBLICATION IS PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED,

INC LUDIN G, BUT NO T LIMITED TO, THE IMPLIED WARRAN TIES OF MERCH AN TABILITY, FITNESS FOR A

PARTICULAR PURPOSE, OR NON-INFRINGEMENT.

THIS PUBLICATION COULD IN CLUDE TECHN ICAL INACCU RACIES OR TYPOGRAPH ICAL ERRORS. CHAN GES ARE

PERIODICALLY ADDED TO THE INFORMATION H EREIN; THESE CHAN GES WILL BE INCORPORATED IN N EW

EDITIONS OF THE PUBLICATION. SUN MICROSYSTEMS, INC. MAY MAKE IMPROVEMEN TS AN D/ OR CH AN GES IN

THE PRODU CT(S) AN D/ OR THE PROGRAM(S) DESCRIBED IN THIS PUBLICATION AT AN Y TIME


3/74

iii

Contents

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Round ing Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Floating-point Formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Relative Error and Ulps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Guard Digits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

The IEEE Stand ard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

Formats and Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

Special Quantities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

NaNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Exceptions, Flags and Trap H andlers . . . . . . . . . . . . . . . . . . 32

Systems Asp ects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

Instruction Sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Languages and Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . 40


4/74

iv What Every Computer Scient ist Should Know A bout Floating-Point ArithmeticJune 1992

Exception H and ling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

The Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Rounding Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

Errors In Sum mation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

Sum ma ry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Theorem 14 and Theorem 8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65


5/74

1

What Every Computer ScientistShould Know About Floating-Point

Arithmetic

Note This document is a reprint of the paper What Every Computer ScientistShould Know About Floating-Point Arithmetic, pu blished in the Ma rch, 1991 issueof Comp uting Surveys. Copyrigh t 1991, Association for Comp utingMachinery, Inc., reprinted by p ermission.

Abstract

Floating-point arithm etic is considered an esoteric subject by m any peop le.This is rather surprising because floating-point is ubiquitous in computersystems. Almost every langu age has a floating-point da tatype; comp uters fromPCs to sup ercompu ters have floating-p oint accelerators; most comp ilers willbe called upon to compile floating-point algorithms from time to time; andvirtually every operating system must respond to floating-point exceptionssuch as overflow. This paper presents a tutorial on those aspects of floating-point that have a direct impact on designers of computer systems. It beginswith background on floating-point representation and round ing error,continues with a d iscussion of the IEEE floating-point stand ard, and conclud eswith nu merous examples of how comp uter builders can better supportfloating-point.

Categories and Subject Descriptors: (Primary) C.0 [Comp uter SystemsOrganization]: General instruction set design; D.3.4 [ProgrammingLanguages]: Processors compilers, optimization; G.1.0 [Numerical Analysis]:General computer arithmetic, error analysis, numerical algorithms (Secondary)


6/74

2 What Every Computer Scient ist Should Know A bout Floating-Point Arithmetic

D.2.1 [Software Engineering]: Requiremen ts/ Specifications languages; D.3.4[Programming Languages]: Formal Definitions and Theory semantics; D.4.1[Operating Systems]: Process Man agemen t synchronization.

General Terms: Algorithms, Design, Langu ages

Add itional Key Words an d Ph rases: Denorm alized nu mber, exception, floating-point, floating-point standard, gradu al und erflow, guard digit, NaN , overflow,

relative error, rounding error, rounding mode, ulp, underflow.

Introduction

Builders of computer systems often need information about floating-pointarithmetic. There are, how ever, remarkab ly few sou rces of detailed informa tionabout it. One of the few books on the sub ject, Floating-Point Computation by PatSterbenz, is long out of pr int. This paper is a tu torial on those aspects offloating-point arithm etic (floating-pointhereafter) that h ave a d irect connectionto systems bu ilding. It consists of three loosely connected p arts. The first(Section , Round ing Error, on pag e 2) discusses the imp lications of u singdifferent roun ding strategies for the ba sic operations of ad dition, subtraction,multiplication and division. It also contains background information on the

two m ethods of measuring round ing error, ulps an d relative error. Thesecond part discusses the IEEE floating-point standard, which is becomingrapid ly accepted by comm ercial hardw are man ufacturers. Includ ed in the IEEEstandard is the rounding method for basic operations. The discussion of thestandard draw s on the material in Section , Round ing Error, on page 2. Thethird part discusses the connections between floating-point and the design ofvarious asp ects of comp uter systems. Topics includ e instruction set d esign,optimizing compilers and exception handling.

I have tried to avoid making statements about floating-point without alsogiving reasons w hy th e statements a re true, especially since the justificationsinvolve nothing more complicated than elementary calculus. Thoseexplanations that are not central to the main argument have been grouped intoa section called The Details, so that they can b e skipped if desired. In

particular, the p roofs of man y of the theorems ap pear in this section. The endof each proof is marked with the symbol; when a proof is not included, the appears immediately following the statement of the theorem.

Rounding Error

Squeezing infin itely man y real num bers into a finite num ber of bits requires anapproximate representation. Although there are infinitely many integers, inmost p rograms th e result of integer comp utations can be stored in 32 bits. In


7/74

3

contrast, given any fixed number of bits, most calculations with real numberswill produce quantities that cannot be exactly represented using that manybits. Therefore the result of a floating-point calculation m ust often be rou nd edin order to fit back into its finite representation. This rounding error is thecharacteristic feature of floating-point compu tation. Relative Error and Ulpson page 5 describes how it is measured.

Since most floating-point calculations hav e round ing error an yw ay, does it

matter if the basic arithm etic operations introd uce a little bit more roun dingerror than necessary? That question is a main theme throughout this section.Guard Digits on page 6 discusses guarddigits, a means of redu cing th e errorwhen subtracting two n earby num bers. Guard digits were consideredsufficiently important by IBM that in 1968 it added a guard digit to the doubleprecision format in the System/ 360 architecture (single precision already h ad aguard digit), and retrofitted all existing machines in th e field. Two examplesare given to illustrate th e utility of guard d igits.

The IEEE standard goes further than just requiring the use of a guard digit. Itgives an algorithm for add ition, subtra ction, mu ltiplication, d ivision an dsquare root, and requires that implementations produce the same result as thatalgorithm. Thus when a program is moved from one machine to another, theresults of the basic operations w ill be the same in ev ery bit if both m achines

support the IEEE standard. This greatly simplifies the porting of programs.Other u ses of this precise specification are given in Exactly Round edOperations on page 13.

Floating-point Formats

Several different representations of real numbers have been proposed, but byfar the most widely used is the floating-point representation.1 Floating-pointrepresentations have a base (which is always assumed to be even) and aprecision p. If = 10 and p = 3 then th e nu mber 0.1 is represented as 1.00 10-1.If = 2 and p = 24, then th e decimal nu mber 0.1 cann ot be represented exactlybu t is a pproxim ately 1.10011001100110011001101 2-4. In general, a floating-point nu mber w ill be represented as d.dd d e, where d.dd dis called thesignificand

2and has p digits. More precisely d0 . d1 d2 dp-1

e

represents thenumber

(1)

1. Examples of other representations arefloating slash an d signed logarithm [Matula and Kornerup 1985;Swartzlander and Alexopoulos 1975].

2. This term was introd uced by Forsythe and Moler [1967], and h as generally replaced the older t erm mantissa.

d0 d11 d

p 1 p 1( )+ + +( ) e 0 d

i


8/74


The term floating-point numberwill be used to mean a real num ber that can beexactly represented in the format under discussion. Two other parametersassociated with floating-point representations are the largest and smallestallowable expon ents, emax an d emin . Since there a re p possible significand s, andemax - emin + 1 possible exponents, a floating-point number can be encoded in

bits, wh ere the final +1 is for the sign bit. The precise encoding is not impor tantfor now.

There are two reasons wh y a real num ber might n ot be exactly representable asa floating-point number. The most common situation is illustrated by thedecimal number 0.1. Although it has a finite decimal representation, in binaryit has an infinite repeating representation. Thus wh en = 2, the number 0.1 liesstrictly between two floating-point numbers and is exactly representable byneither of them. A less common situation is that a real number is out of range,that is, its absolute value is larger than emax or smaller than 1.0 emin .Most of this paper discusses issues due to the first reason. However, numbersthat are out of range will be discussed in Infinity on p age 27 and

Denormalized N um bers on page 29.

Floating-point representations are n ot necessarily u nique. For examp le, both0.01 101 and 1.00 10-1 represent 0.1. If the leading d igit is nonzero (d0 0 inequation (1) above), then the rep resentation is said to be normalized. Thefloating-point number 1.00 10-1 is norm alized, wh ile 0.01 101 is not. When = 2, p = 3, emin = -1 and emax = 2 there are 16 norm alized floating-pointnum bers, as shown in Figure 1. The bold h ash marks correspond to nu mberswh ose significand is 1.00. Requiring th at a floating-point rep resentation benorm alized makes th e representation u nique. Un fortunately, this restrictionmakes it imp ossible to represent zero! A natur al way to represent 0 is with1.0 emin -1, since this preserves the fact that the n um erical ordering ofnonnegative real numbers corresponds to the lexicographic ordering of theirfloating-point representations.3 When the exponent is stored in a kbit field,

that means that only 2k

- 1 values are available for use as exp onents, since onemust be reserved to represent 0.

Note that the in a floating-point number is part of the notation, and differentfrom a floating-point multiply operation. The meaning of the symbol shouldbe clear from the context. For example, the expression (2.5 10-3) (4.0 102)involves only a single floating-point m ultiplication.

3. This assum es the usual arrangem ent where the exponent is stored to the left of the significand.

log2

emax

emin

1+( ) log2

p( ) 1+ +


9/74

5

Figure 1 Normalized nu mbers when = 2,p = 3, emin = -1, emax = 2

Relative Error and Ulps

Since rounding error is inherent in floating-point computation, it is importantto have a way to measure this error. Consider the floating-point format with = 10 and p = 3, which will be used throu ghou t this section. If the result of afloating-point comp utation is 3.12 10-2, and the answer when computed toinfinite precision is .0314, it is clear that this is in error by 2 units in the lastplace. Similarly, if the real n um ber .0314159 is represen ted as 3.14 10-2, then itis in error by .159 units in th e last place. In general, if the floating-pointnumber d.dd e is used to represent z, then it is in error by d.dd-(z/e)p-1 un its in the last place.4, 5The term ulps will be used as shorthandfor un its in the last p lace. If the result of a calculation is th e floating-pointnu mber nearest to the correct result, it still might be in error by a s mu ch as .5ulp. Another way to measure the difference between a floating-point numberand the real number it is approximating is relative error, which is simply thedifference between the two numbers divided by the real number. For examplethe relative error committed wh en ap proximating 3.14159 by 3.14 100 is.00159/ 3.14159 .0005.

To compu te the relative error that correspond s to .5 ulp, observe that wh en areal number is approximated by the closest possible floating-point numberd.dd...dd e, the error can be as large as 0.00...00 e, where is the digit / 2, there arep units in the significand of the floating point number, and punits of 0 in the significand of the error. This error is (( / 2)-p) e. Sincenumbers of the form d.dddd e all have the same absolute error, but havevalues that range between e an d e, the relative error ran ges between((/ 2)-p) e/e and ((/ 2)-p) e/e+1. That is,

(2)

4. Unless the numberz is larger than emax+1 or smaller than emin . Numb ers which are out of range in thisfashion will not be considered until further notice.

5. Let z be the floating point number that app roximates z. Then d.dd- (z/e)p-1 is equivalent toz-z/ulp(z). (SeeNu merical Computation Guide for the definition of ulp(z)). A m ore accurate formula formeasuring error is z-z/ulp (z). -- Ed.

0 1 2 3 4 5 6 7

1

2 p

1

2ulp

2

p


10/74


In particular, the relative error correspond ing to .5 ulp can va ry by a factor of. This factor is called the wobble.Setting = (/ 2)-p to the largest of thebound s in (2) above, we can say that wh en a real number is round ed to theclosest floating-point num ber, the relative error is alw ays boun ded by , wh ichis referred to as machine epsilon.

In the exam ple above, th e relative error wa s .00159/ 3.14159 .0005. In o rder toavoid such small numbers, the relative error is normally written as a factor

times , which in this case is = (/ 2)-p = 5(10)-3 = .005. Thu s the relative er rorwou ld be expressed as (.00159/ 3.14159)/ .005) 0.1.

To illustrate the d ifference betw een u lps an d relative error, consider the realnumber x = 12.35. It is approximated by = 1.24 101. The error is 0.5 ulps,the relative error is 0.8. Next consider the computation 8 . The exact value is8x = 98.8, wh ile the compu ted value is 8 = 9.92 101. The error is now 4.0ulps, but the relative error is still 0.8. The error measured in ulps is 8 timeslarger, even thou gh th e relative error is the sam e. In gen eral, when the base is, a fixed relative error expressed in ulps can wobble by a factor of up to .And conversely, as equation (2) above show s, a fixed error of .5 ulps results ina relative error tha t can w obble by .

The most natural w ay to measure round ing error is in ulps. For example

rounding to the nearest floating-point number corresponds to an error of lessthan or equal to .5 ulp. How ever, when analyzing the rounding error causedby variou s formulas, relative error is a better measu re. A good illustration ofthis is the ana lysis on page 51. Since can overestimate the effect of round ingto the nearest floating-point number by the wobble factor of, error estimatesof formulas will be tighter on machines with a small .

When only the order of magnitude of rounding error is of interest, ulps an d may be used interchan geably, since they d iffer by at m ost a factor of. Forexample, when a floating-point number is in error by n ulps, that means thatthe number of contaminated digits is log n. If the relative error in acomputation is n, then

contaminated digits log n. (3)

Guard Digits

One method of computing the difference between two floating-point numbersis to compute the difference exactly and then round it to the nearest floating-point n um ber. This is very expensive if the operan ds d iffer greatly in size.Assuming p = 3, 2.15 1012 - 1.25 10-5 would be calculated as

xx

x


11/74

7

x = 2.15 1012y = .0000000000000000125 1012

x -y = 2.1499999999999999875 1012

which rounds to 2.15 1012. Rather than using all these d igits, floating-pointhardw are normally operates on a fixed num ber of digits. Sup pose that thenumber of digits kept is p, and that w hen the sm aller operand is shifted right,digits are simply discarded (as opposed to rounding). Then

2.15 1012 - 1.25 10-5 becomesx = 2.15 1012y = 0.00 1012

x -y = 2.15 1012

The answ er is exactly the same as if the difference had been comp uted exactlyand then roun ded . Take an other exam ple: 10.1 - 9.93. This becomes

x = 1.01 101y = 0.99 101

x -y = .02 101

The correct answer is .17, so the computed difference is off by 30 ulps and iswrong in every digit! How bad can the error be?

Theorem 1

Using a floating-point format with parameters and p, and computing differencesusing p digits, the relative error of the result can be as large as - 1.

Proof

A relative error of - 1 in the expression x -y occurs when x = 1.000 andy = ., where = - 1. Here y has p digits (all equal to ). The exactd ifference is x -y = -p. How ever, when computing the answ er using only pdigits, the rightmost digit ofy gets shifted off, and so the computedd ifference is -p+1. Thus the error is -p - -p+1 = -p ( - 1), and the relativeerror is -p( - 1)/-p = - 1.

When =2, the relative error can be as large as the resu lt, and w hen =10, it canbe 9 times larger. Or to p ut it an other w ay, wh en =2, equa tion (3) above show sthat the nu mber of contam inated d igits is log2(1/) = log2(2p) = p. That is, all ofth e p digits in the result are wrong! Suppose that one extra digit is added toguard against this situation (a guard digit). That is, the smaller nu mber istruncated to p + 1 digits, and then the result of the subtraction is rounded to pdigits. With a guard digit, the previous example becomes


12/74


x = 1.010 101y = 0.993 101

x - y = .017 101

and the answ er is exact. With a single guard digit, the relative error of theresult may be greater than , as in 110 - 8.59.

x = 1.10 102

y = .085 102

x -y = 1.015 102

This round s to 102, comp ared with the correct answ er of 101.41, for a relativeerror of .006, which is greater than = .005. In general, the relative error of theresult can be on ly slightly larger tha n . More pr ecisely,

Theorem 2

If x and y are floating-point numbers in a format with parameters and p, and ifsubtraction is done with p + 1 digits (i.e. one guard digit), then the relativerounding error in the result is less than 2.

This theorem will be proven in Rounding Error on p age 50. Add ition isincluded in the above theorem since x an d y can be positive or negative.

Cancellation

The last section can be summarized by saying that without a guard digit, therelative error committed when subtracting two nearby quantities can be verylarge. In other w ords, the evaluation of any expression containing a subtr action(or an add ition of quan tities with op posite signs) could result in a relative errorso large that all the d igits are meaningless (Theorem 1). When su btractingnearby quantities, the most significant digits in the operands match and canceleach other. There are tw o kind s of cancellation: catastrophic and benign.

Catastrophic cancellation occurs w hen the operands are subject to round ingerrors. For examp le in th e qua dra tic formula, the expression b2 - 4ac occurs.

The quan tities b2 and 4ac are subject to roun ding errors since they are theresults of floating-point multiplications. Suppose that they are rounded to thenearest floating-point num ber, and so are accurate to within .5 ulp. When theyare subtracted, cancellation can cause ma ny of the accurate d igits to disappear,leaving behind mainly digits contaminated by rounding error. Hence thedifference might have an error of many ulps. For example, consider b = 3.34,a = 1.22, and c = 2.28. The exact value ofb2 - 4ac is .0292. But b2 roun ds to 11.2and 4ac roun ds to 11.1, hence the final answ er is .1 wh ich is an error by 70


13/74

9

ulps, even though 11.2 - 11.1 is exactly equal to .16. The subtraction did notintroduce any error, but rather exposed the error introduced in the earliermultiplications.

Benign cancellation occurs w hen su btracting exactly know n qu antities. Ifx an d yhave no rounding error, then by Theorem 2 if the subtraction is done with aguard digit, the differencex-y has a very sm all relative error (less than 2).

A formu la that exhibits catastrophic cancellation can som etimes be rearrang edto eliminate the problem. Again consider the quadratic formula

(4)

When , then does not involve a cancellation and .But the other addition (subtraction) in one of the formulas will have acatastrophic cancellation. To avoid this, multiply the n um erator anddenominator ofr1 by

(and similarly for r2) to obtain

(5)

If and , then computing r1 using form ula (4) will involve acancellation. Therefore, use (5) for computing r1 and (4) for r2. On the otherhand , ifb < 0, use (4) for computing r1 and (5) for r2.

The expression x2 -y2 is another formula that exhibits catastrophic cancellation.It is more accurate to evaluate it as (x -y)(x +y).7 Unlike the quadratic formula,this improved form still has a su btraction, but it is a benign cancellation ofquan tities withou t roun ding error, not a catastrophic one. By Theorem 2, the

6. 700, not 70. Since .1 - .0292 = .0708, the er ror in ter ms of u lp(0.0292) is 708 ulp s. -- Ed.

r1b b2 4ac+

2ar2,

b b2 4ac2a

= =

b2

ac b2

4ac b2 4ac b

b b2 4ac

r12c

b b2 4acr2,

2c

b b2 4ac+= =

b2

ac b 0>


14/74


relative error in x - y is at most 2. The same is true ofx + y. Multiplying twoquantities with a small relative error results in a product with a small relativeerror (see Round ing Error on p age 50).

In order to avoid confusion between exact and computed values, the followingnotation is used. Whereas x -y den otes the exact d ifference ofx an d y, x yden otes the comp uted difference (i.e., with roun ding error). Similarly , , and

den ote compu ted ad dition, mu ltiplication, and division, respectively. All

caps indicate the computed value of a function, as in LN(x) or SQRT(x). Lowercase functions and traditional mathematical notation denote their exact valuesas in ln(x) and .

Although (x y) (x y) is an excellent ap proximation to x2 - y 2, thefloating-point num bersx an d y might themselves be approximations to sometr ue quan titie s and . Fo r example, and m ight be exact ly known decim a lnu mbers th at cannot be exp ressed exactly in binary. In th is case, even thou ghx y is a good approximation to x -y, it can h ave a h uge relative errorcompared to the t rue expression , and so the advantage of (x + y)(x - y)over x2 - y2 is not as dr amatic. Since comp uting (x + y)(x - y) is about the sameamoun t of work as compu ting x2 - y 2, it is clearly the preferred form in thiscase. In g eneral, how ever, replacing a catastroph ic cancellation by a ben ign oneis not worthwhile if the expense is large because the input is often (but not

always) an app roximation. But eliminating a cancellation entirely (as in th equadratic formula) is worthwhile even if the data are not exact. Throughoutthis paper, it will be assumed that the floating-point inputs to an algorithm areexact and that the results are computed as accurately as possible.

The expression x2 -y2 is more accurate when rewritten as (x - y)(x + y) becausea catastrophic cancellation is replaced w ith a benign one. We next present m oreinteresting examples of form ulas exhibiting catastrophic cancellation that canbe rewr itten to exhibit only benign cancellation.

The area of a triangle can be expressed d irectly in terms of the lengths of itssides a, b, and c as

(6)

7. Although the expression (x -y)(x +y) does not cause a catastrophic cancellation, it is slightly less accuratethanx2 - y2 if or . In this case, (x -y)(x +y) has three rounding errors, but x2 - y2 has only twosince the round ing error committed w hen comput ing the smaller ofx2 an dy2 does not affect the finalsubtraction.

x y x y

x

x y x y

x y

A s s a( ) s b( ) s c( ) where s, a b c+ +( ) 2= =


15/74

11

Sup pose th e triangle is very flat; that is, a b + c. Then s a, and the term (s - a)in eq. (6) subtracts two nearby numbers, one of which may have roundingerror. For example, ifa = 9.0, b = c = 4.53, then the correct value ofs is 9.03 andA is 2.342... . Even thoug h th e compu ted v alue ofs (9.05) is in error by only 2ulps, the compu ted value ofA is 3.04, an er ror o f 70 ulps.

There is a way to rewrite formu la (6) so that it will return a ccurate results evenfor flat triangles [Kahan 1986]. It is

(7)

Ifa, b an d c do not satisfy a b c, simply renam e them before applying (7). Itis straightforw ard to check that the right-hand sides of (6) and (7) arealgebraically identical. Using the v alues ofa, b, and c above gives a computedarea of 2.35, which is 1 ulp in error and m uch more accurate than th e firstformula.

Although formula (7) is much m ore accura te than (6) for this examp le, it w ouldbe nice to know how well (7) performs in general.

Theorem 3

The rounding error incurred when using (7) to compute the area of a triangle is atmost11, provided that subtraction is performed with a guard digit, e .005, andthat square roots are computed to within 1/ 2 ulp.

The condition that e < .005 is met in virtu ally every actual floating-pointsystem. For example when = 2, p 8 ensures that e < .005, and wh en = 10,p 3 is enough.

In statements like Theorem 3 th at d iscuss th e relative error of an expression, itis understood that the expression is computed using floating-point arithmetic.In particular, the relative error is actually of the expression

SQRT((a (b c)) (c (a b)) (c (a b)) (a (b c))) 4 (8)

Because of the cumbersom e natu re of (8), in the statement of theorem s we w illusually say the computed value of Erather than w riting out Ewith circlenotation.

Aa b c+( )+( ) c a b( )( ) c a b( )+( ) a b c( )+( )

4a b c ,=


16/74


Error bounds are usually too pessimistic. In the numerical example givenabove, the comp uted value of (7) is 2.35, compared with a tru e value of 2.34216for a relative error of 0.7, wh ich is mu ch less than 11. The main reason forcomputing error bounds is not to get precise bounds but rather to verify thatthe formula does not contain numerical problems.

A final example of an expression that can be rewritten to use benign

cancellation is (1 +x)n, where . Th is exp ression ar ises in financia l

calculations. Consider d epositing $100 every d ay into a ban k account thatearns an annual interest rate of 6%, compounded daily. Ifn = 365 and i = .06,

the amount of money accumulated at the end of one year is 100

dollars. If this is compu ted u sing = 2 and p = 24, the result is $37615.45compared to the exact answer of $37614.05, a discrepancy of $1.40. The reasonfor the p roblem is easy to see. The expression 1 + i/n involves adding 1 to.0001643836, so the low ord er b its ofi/n are lost. This round ing error isamplified when 1 + i/n is raised to the nth pow er.

The troublesome expression (1 + i/n)n can be rewritten as enln(1 + i/n), wherenow the problem is to comp ute ln(1 +x) for sm allx. One app roach is to use theapp roximation ln(1 + x) x, in which case the payment becomes $37617.26,

wh ich is off by $3.21 and even less accura te than the obviou s formu la. Butthere is a way to compute ln(1 + x) very accurately, as Theorem 4 show s[Hewlett-Packard 1982]. This formula yields $37614.07, accurate to within twocents!

Theorem 4 assumes that LN (x) approximates ln(x) to within 1/ 2 ulp. Theproblem it solves is that when x is small, LN(1 x) is not close to ln(1 + x)because 1 x has lost the information in the low order bits ofx. That is, thecomputed value of ln(1 + x) is not close to its actu al v alu e w h en .

Theorem 4

If ln(1 + x) is computed using the formula

x for1 x = 1ln(1 + x) = for1 x 1

the relative error is at most5 when 0 x < , provided subtraction is performed

with a guard digit, e < 0.1, and ln is computed to within 1/ 2 ulp.

x 1

1 i n+( ) n 1i n

x 1

x ln(1+x)

1 x+( ) 1

3

4


17/74

13

This formula will work for any value ofx bu t is on ly in teresting for ,wh ich is w here catastrophic cancellation occurs in th e naive formu la ln(1 + x).Although the formula may seem mysterious, there is a simple explanation for

wh y it w orks. Write ln(1 +x) as . The left hand factor

can be compu ted exactly, but the righ t han d factor (x) = ln(1 +x)/x will suffera large rounding error when add ing 1 to x. However, is almost constant,

since ln(1 + x) x. So chan ging x slightly w ill not introd uce m uch error. Inother w ord s, if , com pu ting w ill be a good ap proxim ation to x(x)= ln(1 +x). Is there a valu e for for w hich an d can be com pu ted

accurately? There is; namely = (1 x) 1, becau se th en 1 + is exactlyequal to 1 x.

The results of this section can be summarized by saying that a guard digitguarantees accuracy when nearby precisely known quantities are subtracted(benign cancellation). Sometimes a formula th at gives inaccurate results can berewritten to have much higher numerical accuracy by using benigncancellation; how ever, the p rocedure on ly work s if subtraction is performedusing a guard digit. The price of a guard digit is not high, because it merelyrequires making the adder one bit wider. For a 54 bit double precision adder,the ad ditional cost is less than 2%. For this price, you gain the ability to runmany algorithms such as the formula (6) for computing the area of a triangleand the expression ln(1 +x). Although most mod ern computers have a gu arddigit, there are a few (such as Crays) that do not.

Exactly Rounded Operations

When floating-point operations are done with a guard digit, they are not asaccurate as if they were computed exactly then rounded to the nearest floating-point number. Operations performed in this manner will be called exactlyrounded.8 The example immediately preceding Theorem 2 shows that a singleguard digit will not always give exactly rounded results. The previous sectiongave several examp les of algorithms that require a gua rd d igit in order to w ork

prop erly. This section gives examp les of algorithm s that require exactrounding.

So far, the definition of rounding has not been given. Rounding isstraightforward, with the exception of how to round halfway cases; forexample, should 12.5 roun d to 12 or 13? One school of though t d ivides the 10digits in half, letting {0, 1, 2, 3, 4} roun d dow n, and {5, 6, 7, 8, 9} roun d u p; thu s

8. Also comm only referred to as correctly rounded. -- Ed.

x 1

xln 1 x+( )

x( ) x x( )=

x x x x( )x x x 1+

x x


18/74


12.5 would round to 13. This is how rounding works on Digital EquipmentCorpora tions VAX comp uters. An other school of thou ght says that sincenum bers ending in 5 are halfway between tw o p ossible round ings, they shouldround dow n half the time and roun d u p the other half. One way of obtainingthis 50% behavior to require that the roun ded result have its least significantdigit be even. Thus 12.5 roun ds to 12 rather th an 13 because 2 is even. Which ofthese methods is best, round up or round to even? Reiser and Knuth [1975]offer the following reason for preferring rou nd to even.

Theorem 5

Let x and y be floating-point numbers, and define x0 = x, x1 = (x0 y) y, ,xn = (xn-1 y) y. If and are exactly rounded using round to even, theneither xn = x for all n or xn = x1 for all n 1.

To clarify this result, consider = 10, p = 3 and let x = 1.00, y = -.555. Whenround ing up , the sequence becomes x0 y = 1.56, x1 = 1.56 .555 = 1.01,x1 y = 1.01 .555 = 1.57, and each successive value ofxn increases by .01,untilxn = 9.45 (n 845)

9 Und er round to even, xn is always 1.00. This examplesuggests that w hen u sing the roun d u p ru le, compu tations can gradually driftup ward , whereas when using round to even the theorem says this cannothapp en. Throughou t the rest of this pap er, round to even w ill be used.

One ap plication of exact roun ding occurs in mu ltiple p recision arithm etic.There are two basic approaches to higher precision. One approach representsfloating-point numbers using a very large significand, which is stored in anarray of words, and codes the routines for manipulating these numbers inassembly language. The second approach represents higher precision floating-point num bers as an array of ordinary floating-point num bers, wh ere addingthe elements of the array in infinite p recision recovers the high precisionfloating-point number. It is this second approach that will be discussed here.The advantage of using an array of floating-point numbers is that it can becoded portably in a high level language, but it requires exactly roundedarithmetic.

The key to multiplication in this system is representing a product xy as a sum,

where each summan d h as the same precision as x an d y. This can be d one bysplitting x an d y. Writing x = xh + xl an d y = yh + y l, the exact prod uct is xy =xhyh + xhyl + xlyh + x ly l. Ifx an d y havep bit significands, the summands willalso have p bit significands provided that xl,xh,yh,y l can be represented usingp/ 2 bits. When p is even, it is easy to find a splitting. The nu mberx0.x1 xp - 1 can be written as the sum ofx0.x1 xp/ 2 - 1 an d

9. When n = 845, xn= 9.45, xn + 0.555 = 10.0, and 10.0 - 0.555 = 9.45. Therefo re, xn = x845 for n > 845.


19/74

15

0.0 0xp/ 2 xp - 1. When p is odd, this simple splitting method wont work.An extra bit can, however, be gained by using negative numbers. For example,if = 2,p = 5, and x = .10111,x can be split as xh = .11 and x l = -.00001. There ismore than one way to split a number. A splitting method that is easy tocompu te is due to D ekker [1971], but it requires more than a single guard digit.

Theorem 6

Let p be the floating-point precision, with the restriction that p is even when > 2,and assume that floating-point operations are exactly rounded. Then if k = p/ 2 ishalf the precision (rounded up) and m = k+ 1, x can be split as x = xh + xl, wherexh = (m x) (m x x), x l = x xh, and each xi is representable using p/2bits of precision.

To see how this theorem works in an example, let = 10, p = 4, b = 3.476, a =3.463, and c = 3.479. Then b2 - ac round ed to the nearest floating-point nu mberis .03480, while b b = 12.08, a c = 12.05, and so the computed value ofb2 -ac is .03. This is an error of 480 ulps. Using Theorem 6 to write b = 3.5 - .024,a = 3.5 - .037, and c = 3.5 - .021, b2 becomes 3.52 - 2 3.5 .024 + .0242. Eachsummand is exact, so b2 = 12.25 - .168 + .000576, where the sum is leftun evaluated at this p oint. Similarly, ac = 3.52 - (3.5 .037 + 3.5 .021) + .037 .021 = 12.25 - .2030 +.000777. Finally, sub tracting these tw o series term by t erm

gives an estimate for b2 - ac of 0 .0350 .000201 = .03480, wh ich is identicalto the exactly round ed result. To show that Theorem 6 really requires exactrounding, consider p = 3, = 2, and x = 7. Then m = 5, mx = 35, and m x = 32.If subtraction is performed with a single guard digit, then ( m x) x = 28.Therefore, xh = 4 and xl = 3, hence x l is not representable with p/2 = 1 bit.

As a final examp le of exact round ing, consider d ividing m by 10. The resu lt is afloating-point number that will in general not be equal to m/ 10. When = 2,multiplying m/ 10 by 10 will miraculou sly restore m, provided exact round ingis being used . Actually, a m ore general fact (du e to Kahan ) is true. The proof isingenious, but readers not interested in such details can skip ahead to Section ,The IEEE Stand ard, on page 17.

Theorem 7When = 2, if m and n are integers with | m| < 2p - 1 and n has the special form n= 2i + 2j, then (m n) n = m, provided floating-point operations are exactlyrounded.


20/74


Proof

Scaling by a p ow er of two is harm less, since it chang es only the expon entnot th e significand . Ifq = m/n, then scale n so that 2p - 1 n < 2p and scale mso that < q < 1. Thus, 2p - 2 < m < 2p. Since m has p significant b its, it ha s atmost one bit to the right of the binary point. Changing the sign ofm isharmless, so assume that q > 0.

If = m n, to prove the theorem requires showing that

(9)

That is because m has at most 1 bit right of the binary point, so n willround to m. To deal with the halfway case when | n - m | = , note thatsince the initial unscaled m had | m | < 2p - 1, its low-ord er bit was 0, so thelow-order b it of the scaled m is also 0. Thu s, halfwa y cases will round to m.

Sup pose that q = .q1q2 , and let = .q1q2 qp1. To estim ate | n - m | , fi rstcompute | - q| = | N/ 2p + 1 - m/n| , where N is an odd integer. Since n =2i + 2j and 2p - 1 n < 2p, it must be that n = 2p - 1 + 2kfor some kp - 2, andthus

.

The numerator is an integer, and since N is odd , it is in fact an odd integer.

Thus, | - q| 1/ (n2p + 1 - k). Assume q < (the case q > is sim ila r).10

Then n < m, and

| m-n | = m-n = n(q- ) = n(q-( -2-p-1) )

= (2p-1+2k)2-p-1+2-p-1+k =

This establishes (9) and proves th e theorem 11.

10. Notice that in binar y, q cannot equa l . - - Ed .

11. Left as an exercise to the reader: extend the p roof to bases other than 2. -- Ed.

2

q

nq m

1

4

qq

4

q qq

q q nN 2p 1+

m

n2p 1+

2p 1 k

1+( )N 2p 1 k+ m

n2p 1 k+

= =

q q q

q

q

q q q q n 2 p 11

n2p 1 k+

1

4


21/74

17

The theorem holds true for any base , as long as 2i + 2j is replaced by i + j.As gets larger, how ever, den ominators of th e form i + j are farther andfarther apart.

We are now in a p osition to an swer th e question, Does it matter if the basicarithmetic operations introduce a little more rounding error than necessary?The answ er is that it does m atter, because accurate basic operations enable usto prove th at formu las are correct in the sense they h ave a sm all relative

error. Cancellation on page 8 discussed several algorithms that require guarddigits to prod uce correct results in this sense. If the inpu t to those formu las arenumbers representing imprecise measurements, however, the bounds ofTheorems 3 and 4 become less interesting. The reason is that the benigncancellation x -y can become catastroph ic ifx an d y are only approximations tosome m easured qua ntity. But accurate op erations are useful even in the face ofinexact d ata, because they en able us to establish exact relationships like thosediscussed in Theorems 6 an d 7. These are useful even if every floating-pointvariable is only an approximation to some actual value.

The IEEE Standard

There are two different IEEE standa rds for floating-point compu tation. IEEE

754 is a binary stand ard th at requires = 2, p = 24 for single precision and p =53 for d ouble p recision [IEEE 1987]. It also specifies th e p recise layout of bits ina single and double precision. IEEE 854 allows either = 2 or = 10 and un like754, does not specify how floating-point numbers are encoded into bits [Codyet al. 1984]. It does not require a particular value for p, but instead it specifiesconstraints on the allowable values ofp for single and dou ble precision. Theterm IEEE Standardwill be used when discussing properties common to bothstandards.

This section prov ides a tou r of the IEEE stand ard. Each sub section d iscussesone aspect of the standard and wh y it was includ ed. It is not the pu rpose ofthis paper to argue that the IEEE standard is the best possible floating-pointstandard bu t rather to accept the standard as given and provide anintrodu ction to its use. For full details consult the stand ards th emselves [IEEE

1987; Cody et al. 1984].


22/74


Formats and Operations

Base

It is clear why IEEE 854 allows = 10. Base ten is how humans exchange andthink about nu mbers. Using = 10 is especially appropriate for calculators,where the result of each operation is displayed by the calculator in decimal.

There are several reasons why IEEE 854 requires that if the base is not 10, itmu st be 2. Relative Error and Ulps on p age 5 mentioned one reason: theresults of error analyses are much tighter w hen is 2 because a round ing errorof .5 ulp wob bles by a factor of when computed as a relative error, and erroranalyses are almost always simp ler wh en based on relative error. A relatedreason ha s to d o w ith the effective p recision for large bases. Consider = 16,p = 1 compared to = 2, p = 4. Both system s hav e 4 bits of significand .Consider th e computation of 15/ 8. When = 2, 15 is represen ted as 1.111 23,and 15/ 8 as 1.111 20. So 15/ 8 is exact. How ever, wh en = 16, 15 isrepresented as F 160, where Fis the hexad ecimal d igit for 15. But 15/ 8 isrepresented as 1 160, which has only one bit correct. In general, base 16 canlose up to 3 bits, so that a p recision ofp hexidecimal d igits can hav e an

effective precision as low as 4p - 3 rather th an 4p binary bits. Since large valuesof have these p roblems, why d id IBM choose = 16 for its system/ 370? OnlyIBM kn ows for sure, but there are tw o p ossible reasons. The first is increasedexponent ran ge. Single precision on the system/ 370 has = 16, p = 6. Hencethe significand requires 24 bits. Since this must fit into 32 bits, this leaves 7 bitsfor the exponent and one for the sign bit. Thus the magnitude of representable

numbers ranges from about to about = . To get a similarexponent range w hen = 2 wou ld requ ire 9 bits of exponen t, leaving on ly 22bits for the significand. However, it was just pointed out that when = 16, theeffective precision can be as low as 4p - 3 = 21 bits. Even worse, when = 2 it ispossible to gain a n extra bit of precision (as explained later in this section), soth e = 2 machine h as 23 bits of precision to comp are w ith a ran ge of 21 24bits for the = 16 machine.

Anoth er possible explanation for choosing = 16 has to d o w ith shifting. Whenadding two floating-point numbers, if their exponents are different, one of thesignificands will have to be shifted to make the radix points line up, slowingdow n the operation. In the = 16,p = 1 system, all the nu mbers betw een 1 and15 have the same exponent, and so no shifting is required when adding any ofth e (152 ) = 105 possible pairs of distinct nu mbers from th is set. However, in the = 2, p = 4 system, these numbers have exponents ranging from 0 to 3, andshifting is required for 70 of the 105 pairs.

1626 16

26

22

8


23/74

19

In most modern hardware, the performance gained by avoiding a shift for asubset of operand s is negligible, and so the small wobble of = 2 makes it thepreferable base. Another advantage of using = 2 is that th ere is a way to gainan extra bit of significance.12 Since floating-point numbers are alwaysnorm alized, the m ost significant bit of the significand is always 1, and th ere isno reason to waste a bit of storage representing it. Formats that u se this trickare said to have a hidden bit. It was already pointed out in Floating-pointFormats on p age 3 that this requ ires a special convention for 0. The methodgiven there was that an exponent ofemin - 1 and a significand of all zeros

represents not , but rather 0.

IEEE 754 single precision is encod ed in 32 bits using 1 bit for th e sign, 8 bits forthe exponen t, and 23 bits for the significand. H owev er, it uses a hidd en bit, sothe significand is 24 bits (p = 24), even th ough it is encoded using on ly 23 bits.

Precision

The IEEE stand ard defin es four d ifferent p recisions: single, dou ble, single-extended, and double-extended. In 754, single and double precisioncorrespond roughly to what most floating-point hardware provides. Single

precision occup ies a single 32 bit word , dou ble precision tw o consecutive 32 bitword s. Extend ed precision is a format that offers at least a little extra p recisionand exponent range (Table 1).

12. This appears t o have fir st been pu blished by Goldb erg [1967], although Knuth ([ 1981], page 211) attributesthis idea to Konrad Zuse.

1.0 2emin

1


24/74


The IEEE standard only specifies a lower bound on how many extra bitsextended precision provides. The minimum allowable double-extended format

is sometimes referred to as 80-bit format, even though the table shows it using79 bits. The reason is that hard w are implemen tations of extend ed p recisionnormally dont use a hidden bit, and so would use 80 rather than 79 bits.13

The standard p uts the m ost emphasis on extended precision, making norecommend ation concerning dou ble precision, but strongly recommend ing that

Implementations should supp ort the extended format correspond ing to thewidest basic format sup ported,

One m otivation for extend ed p recision comes from calculators, which w illoften display 10 digits, but use 13 digits internally. By displaying only 10 of the13 digits, the calculator appears to the user as a black box that computesexpon entials, cosines, etc. to 10 digits of accuracy. For the calculator to

compu te functions like exp, log and cos to within 10 digits with reasonableefficiency, it needs a few extra digits to w ork w ith. It isnt hard to find a simp lerational expression th at ap proximates log with an error of 500 un its in th e last

13. According to Kahan, extended p recision has 64 bits of significand because that was th e widest p recisionacross which carry prop agation could be d one on th e Intel 8087 without increasing the cycle time [Kahan1988].

Table 1 IEEE 754 Format Parameters

Parameter

Format

S ingl e S ingl e-Exte nded DoubleDouble-

Extended

p 24 32 53 64

emax +127 1023 +1023 > 16383

emin -126 -1022 -1022 -16382

Exponent w idth in

bits

8 11 11 15

Format width in bits 32 43 64 79


25/74

21

place. Thu s compu ting w ith 13 digits gives an answ er correct to 10 digits. Bykeeping these extra 3 digits hidden, the calculator presents a simple model tothe operator.

Extend ed precision in the IEEE stand ard serves a similar function. It enableslibraries to efficiently compu te qua ntities to within abou t .5 ulp in single (ordou ble) precision, giving the u ser of those libraries a simple mod el, nam elythat each primitive operation, be it a simple multiply or an invocation of log,

returns a value accurate to within about .5 ulp. How ever, when usingextended precision, it is important to make sure that its use is transparent tothe u ser. For examp le, on a calculator, if the internal rep resentation of adisplayed value is not rounded to the same precision as the display, then theresult of further operations will depend on the hidd en d igits and appearunpredictable to the user.

To illustrate extend ed precision fur ther, consider the p roblem of convertingbetween IEEE 754 single precision and decimal. Ideally, single precisionnum bers will be printed with enough digits so that when the decimal num beris read back in, the single precision n um ber can be recovered. It turn s out th at9 decimal digits are enough to recover a single precision binary nu mber (seeBinary to Decimal Conversion on pag e 59). When converting a d ecimalnu mber back to its un ique binary representation, a round ing error as small as 1

ulp is fatal, because it w ill give the wron g answ er. Here is a situation w hereextended precision is vital for an efficient algorithm . When single-extended isavailable, a very straightforward method exists for converting a decimalnu mber to a single precision binary on e. First read in th e 9 decimal digits as an

integer N, ignoring th e decimal point. From Table 1,p 32, and since 109 < 232

4.3 109, N can be represented exactly in single-extended. Next find theapprop riate power 10P necessary to scale N. This will be a combination of theexponent of the decimal number, together with the position of the (up until

now) ignored decimal point. Compute 10 | P| . If | P| 13, then this is alsorepresented exactly, because 1013 = 213513, and 513 < 232. Finally multiply (or

divide ifp < 0) N and 10| P| . If this last operation is done exactly, then theclosest binary nu mber is recovered. Binary to Decimal Conversion on page

59 show s how to do the last mu ltiply (or d ivide) exactly. Thus for | P| 13, theuse of the single-extended format enables 9 digit decimal numbers to beconverted to the closest binary n um ber (i.e. exactly round ed). If | P| > 13, thensingle-extended is not enough for the above algorithm to always compute theexactly rounded binary equivalent, but Coonen [1984] shows that it is enoughto guar antee that th e conversion of binary to d ecimal and back will recover theoriginal binary number.


26/74


If double precision is supported, then the algorithm above would be run indouble precision rather than single-extended, but to convert double precisionto a 17 digit decimal number and back would require the double-extendedformat.

Exponent

Since the expon ent can be positive or negative, some m ethod mu st be chosen torepresent its sign. Two common methods of representing signed numbers aresign/ magnitude and twos complement. Sign/ magnitud e is the system usedfor the sign of the significand in the IEEE formats: one bit is used to hold thesign, the rest of the bits represent the magnitude of the number. The twoscomplemen t representation is often u sed in integer arithm etic. In th is schem e, anumber in the range [-2p-1, 2p-1 - 1] is represented by the sm allest nonn egativenum ber that is congruent to it modu lo 2p .

The IEEE binary standard doesnt use either of these methods to represent theexponent, but instead uses a biased representation. In th e case of singleprecision, w here the exp onent is stored in 8 bits, the bias is 127 (for d oubleprecision it is 1023). What this means is that if is the value of the exponentbits interpreted as an unsigned integer, then the exponent of the floating-pointnum ber is - 127. This is often called the unbiased exponent to distinguish fromthe bia sed exponen t .

Referring to Table 1 on p age 20, single precision has emax = 127 and emin = -126.

The reason for having | emin | < emax is so that the reciprocal of the smallest

n um ber w ill n ot ov erfl ow. A lth ou gh it is tru e th at th e recip roca l of the largest number will underflow, underflow is usually less serious thanoverflow. Base on p age 18 explained that emin - 1 is used for representing 0,

and Special Qu antities on pag e 24 will introd uce a use for emax + 1. In IEEE

single precision, this means tha t the biased exponen ts range betw een emin - 1 =

-127 and emax + 1 = 128, whereas the unbiased exponents range between 0 and

255, which are exactly the nonnegative numbers that can be represented using8 bits.

Operations

The IEEE stand ard requ ires that the result of add ition, subtraction,multiplication and division be exactly rounded. That is, the result must becomp uted exactly and then round ed to the nearest floating-point number(using round to even). Guard Digits on p age 6 pointed out that compu tingthe exact difference or sum of two floating-point numbers can be veryexpensive w hen their exponen ts are substan tially different. That section

k

kk

1 2emin( )


27/74

23

introduced guard digits, which provide a practical way of computingdifferences while guar anteeing th at the relative error is sm all. How ever,comp uting w ith a single guard digit will not always give the same answer ascomputing the exact result and then rounding. By introducing a second guarddigit and a third sticky bit, differences can be comp uted at only a little morecost than w ith a single guard digit, but th e result is the same as if the differencewere computed exactly and then rounded [Goldberg 1990]. Thus the standardcan be implemen ted efficiently.

One reason for completely specifying th e results of arithmetic operations is toimprove the portability of software. When a program is moved between twomachines and both support IEEE arithmetic, then if any intermediate resultdiffers, it must be because of software bu gs, not from d ifferences in arithm etic.Anoth er ad vantag e of precise specification is that it m akes it easier to reasonabout floating-point. Proofs about floating-point are hard enough, withouthaving to deal with multiple cases arising from multiple kinds of arithmetic.Just as integer p rograms can b e proven to be correct, so can floating-pointprogram s, although wh at is proven in that case is that the round ing error of theresult satisfies certain bou nd s. Theorem 4 is an examp le of such a p roof. Theseproofs are mad e mu ch easier wh en the op erations being reasoned about areprecisely specified. Once an algorithm is p roven to be correct for IEEEarithmetic, it w ill wor k correctly on an y ma chine sup porting the IEEE

standard.

Brown [1981] has p roposed axioms for floating-point that include m ost of theexisting floating-point hard ware. H owever, proofs in th is system cann ot verifythe algorithms of sections Cancellation on page 8 and Exactly Round edOperations on page 13, which require features not present on all hardware.Furthermore, Browns axioms are more complex than simply definingoperations to be performed exactly and then round ed. Thus p roving theoremsfrom Browns axioms is usually more difficult than proving them assumingoperations are exactly rounded.

There is not complete agreement on what operations a floating-point standardshould cover. In add ition to th e basic operations +, -, and / , the IEEEstandard also specifies that square root, remainder, and conversion between

integer and floating-point be correctly round ed. It also requires that conversionbetween internal formats and decimal be correctly rounded (except for verylarge numbers). Kulisch and Miranker [1986] have proposed adding innerprod uct to the list of operations that are p recisely specified. They note th atwhen inner products are computed in IEEE arithmetic, the final answer can bequite wrong. For example sums are a special case of inner products, and thesum ((2 10-30 + 1030) - 1030) - 10-30 is exactly equal to 10-30, but on a m achinewith IEEE arithm etic the comp uted result will be -10-30. It is possible tocomp ute inner produ cts to within 1 ulp with less hardware than it takes toimplemen t a fast mu ltiplier [Kirchner an d Kulish 1987].14 15


28/74


All the operations mentioned in the standard are required to be exactlyrounded except conversion between decimal and binary. The reason is thatefficient algorithms for exactly roun ding all the operations are kn own , exceptconversion. For conversion, the best know n efficient algorithms p rodu ceresults that are slightly worse than exactly round ed on es [Coonen 1984].

The IEEE stand ard d oes not requ ire transcendental functions to be exactlyround ed because of the table makers dilemma. To illustrate, su pp ose you are

making a table of the expon ential function to 4 places. Thenexp(1.626) = 5.0835. Shou ld th is be round ed to 5.083 or 5.084? If exp (1.626) iscomputed more carefully, it becomes 5.08350. And then 5.083500. And then5.0835000. Since exp is transcendental, this could go on arbitrarily long beforedistinguishing whether exp(1.626) is 5.0835000dddor 5.08349999ddd. Thusit is not pr actical to specify that th e precision of tran scendental fun ctions be th esame as if they w ere compu ted to infinite precision and then round ed. Anotherapp roach w ould be to sp ecify transcenden tal functions algorithmically. Butthere does not appear to be a single algorithm that works well across allhard wa re architectures. Rational app roximation, CORDIC,16 and large tablesare three different techniques that are used for computing transcendentals oncontempor ary m achines. Each is app ropriate for a d ifferent class of hard wa re,and at present no single algorithm works acceptably over the wide range ofcurrent hardware.

Special Quantit ies

On some floating-point hardware every bit pattern represents a valid floating-point n um ber. The IBM System/ 370 is an examp le of this. On the other h and ,the VAX reserves some bit p atterns to represent sp ecial nu mbers calledreserved operands. This idea goes back to the CD C 6600, wh ich h ad bit patternsfor the sp ecial quan tities INDEFINITE and IN FINITY.

The IEEE standard continues in this tradition and has NaNs and infinities (NaNstands for Not a Number). Withou t any sp ecial quan tities, there is no good wa yto han dle exceptional situations like taking the squ are root of a negativenu mber, other than a borting compu tation. Under IBM System/ 370 FORTRAN,

14. Some arguments against including inner p roduct as one of the basic operations are presented by Kahanand LeBlanc [1985].

15. Kirchner w rites: It is possible to compu te inner products to within 1 ulp in hardw are in one partial productper clockcycle. The add itionally needed hardware compares to the multiplier array needed any way forthat speed.

16. CORDIC is an acronym for Coordinate Rotation Digital Compu ter and is a method of compu tingtranscendenta l functions that uses mostly shifts and ad ds (i.e., very few multiplications and d ivisions)[Walther 1971]. It is the method ad ditionally needed hard ware compares to the multiplier array neededanyw ay for that speed. d used on both the Intel 8087 and the Motorola 68881.


29/74

25

the default action in response to computing the square root of a negativenu mber like -4 results in the p rinting of an error m essage. Since every bitpattern represents a valid number, the return value of square root must be

som e floa tin g-p oin t n u mber. In th e ca se of Sy stem / 370 FO RTRA N, is

return ed. In IEEE arithmetic, a NaN is returned in this situation. The IEEEstand ard specifies the following sp ecial values (see Table 2): 0, den ormalizednumbers, an d NaNs (there is more than one NaN, as explained in the n ext

section). These special values are all encoded with exponents of either emax + 1or emin - 1 (it was already pointed out that 0 has an exponent ofemin - 1).

NaNs

Tradit ional ly, the computat ion of 0/ 0 or has been treated as anunrecoverable error which causes a computation to halt. However, there areexamples where it makes sense for a computation to continue in such asituation. Consider a subroutine that finds the zeros of a function f, sayzero(f). Trad itionally, zero find ers require the u ser to inpu t an interval [a, b]on which the function is defined and over which the zero finder will search.That is, the subrou tine is called as zero(f, a, b). A more useful zero finderwould not require the user to input this extra information. This more general

zero find er is especially approp riate for calculators, wh ere it is natu ral tosimp ly key in a function, and awkw ard to then h ave to specify the d omain.However, it is easy to see why most zero finders require a domain. The zerofinder does its work by probing the function f at variou s values. If it p robedfor a value outside the d omain off, the code for f might well compute 0/ 0 or

, and the computation w ould h alt, unn ecessarily aborting the zero findingprocess.

Table 2 IEEE 754 Spe cial Values

Exponent Fraction Represents

e = emin - 1 f= 0 0

e = emin - 1 f 0

emin e emax 1.f 2e

e = emax + 1 f= 0

e = emax + 1 f 0 NaN

4 2=

0.f 2emin

1

1


30/74


This problem can be avoid ed b y introdu cing a special value called NaN, andspecifying that the computat ion of expressions like 0/ 0 and produce NaN,rather th an h alting. A list of some of the situations th at can cause a NaN ar egiven in Table 3. Then when zero(f) probes outside the d omain off, thecode for f will return NaN, and the zero finder can continue. That is, zero(f)isnt pu nished for making an incorrect guess. With this example in m ind, it iseasy to see what th e result of combining a NaN with an ordinary floating-pointnum ber should be. Sup pose that the final statement off is

return(-b + sqrt(d))/(2*a). Ifd< 0, then f should return a NaN. Sinced< 0, sqrt(d) is a NaN, and -b + sqrt(d) will be a NaN, if the sum of aNaN and an y other number is a NaN. Similarly if one op erand of a divisionoperation is a NaN, the quotient should be a NaN. In general, whenever a NaNparticipates in a floating-point operation, the result is another NaN.

Anoth er app roach to writing a zero solver that doesnt require the user to inpu ta d omain is to u se signals. The zero-find er could install a signal hand ler forfloating-point exceptions. Then iff was evaluated outside its domain andraised an exception, control wou ld be return ed to th e zero solver. The problemwith this approach is that every language has a different method of handlingsignals (if it has a m ethod at all), and so it has n o hop e of portability.

In IEEE 754, NaNs are often represented as floating-point numbers with the

exponent emax + 1 and nonzero significands. Implementations are free to putsystem-dependent information into the significand. Thus there is not a uniqueNaN, but rather a whole family ofNaNs. When a NaN and an ordinary floating-point num ber are combined, the result should be the same as the NaN operand.Thus if the result of a long compu tation is a NaN, the system-depend entinformation in the significand w ill be the information that wa s generated w henthe first NaN in the comp utation w as generated . Actually, there is a caveat tothe last statement. If both operands are NaNs, then the resu lt will be one ofthose NaNs, but it might not be the NaN that was generated first.

Table 3 Operations That Produ ce a NaN

Operation NaN Produced By

+ + (- )

0

/ 0/ 0, /REM x REM 0, REMy

(when x < 0)

1

x


31/74

27

Infinity

Just as NaNs provide a way to continue a computation when expressions like

0/ 0 or a re encountered , infinit ies provide a way to continue when an

overflow occurs. This is much safer than simply returning the largest

rep resen ta ble nu m ber. A s an exam p le, con sid er com p utin g , w h en

= 10, p = 3, and e

max= 98. Ifx = 3

1070 an d y = 4

1070, then x2 will

overflow, and be replaced by 9.99 1098. Similarly y2, and x2 + y2 will eachoverflow in turn, and be replaced by 9.99 1098. So the final result will be

, wh ich is d rastically wrong: the correct answ er is

5 1070. In IEEE arithmetic, the result ofx2 is , as is y2, x2 + y2 and .So the final result is , which is safer than returning an ordinary floating-pointnumber that is nowhere near the correct answer.17

The division of 0 by 0 results in a NaN. A nonzero number d ivided by 0,how ever, return s infin ity: 1/ 0 = , -1/ 0 = -. The reason for the d istinction isthis: iff(x) 0 and g(x) 0 a s x approaches some limit, then f(x)/g(x) couldhave any value. For example, when f(x) = sin x an d g(x) = x, then f(x)/g(x) 1as x 0. But when f(x) = 1 - cos x,f(x)/g(x) 0. When thinking of 0/ 0 as thelimiting situation of a quotient of two very small num bers, 0/ 0 could rep resentanyth ing. Thu s in the IEEE standard , 0/ 0 results in a NaN. But w hen c > 0, f(x) c, and g(x)0, then f(x)/g(x) , for any an alytic functions f and g. Ifg(x) < 0 for small x, then f(x)/g(x) -, otherw ise the limit is +. So the IEEEstandard defines c/ 0 = , as long as c 0. The sign of depend s on the signsofc and 0 in the u sual w ay, so that -10/ 0 = -, and -10/ -0 = +. You candistinguish between getting because of overflow and getting because ofdivision by zero by checking the status fla gs (which w ill be discussed in d etailin section Flags on p age 36). The overflow flag w ill be set in the first case, thedivision by zero flag in the second.

The rule for determining the result of an operation that has infinity as anoperand is simple: replace infinity with a finite number x and take the limit as

x . Thus 3/ = 0, because . Similarly, 4 - = -, and = .

When the limit d oesnt exist, the result is a NaN, so / will be a NaN (Table 3on page 26 has additional examples). This agrees with the reasoning used toconclud e that 0/ 0 should be a NaN.

17. Fine point: Although the default in IEEE arithmetic is to round overflow ed nu mbers to , it is possible tochange the default (see Rounding Mod es on page 35)

1

x2

y2+

9.99 1098 3.16 1049=

x2

y2+

3 xx lim 0=


32/74


When a subexpression evaluates to a NaN, the value of the entire expression isalso a NaN. In the case of how ever, the valu e of the expression might b e anordinary floating-point number because of rules like 1/ = 0. Here is apractical example that makes u se of the rules for infin ity arithmetic. Consider

computing the function x/ (x2 + 1). This is a bad formu la, because not only w ill

it overflow w hen x is larger than , bu t infinity arithm etic w ill give

the wrong answer because it will yield 0, rather than a number near 1/x.

However,x/ (x2 + 1) can be rewritten as 1/ (x + x-1). This imp roved expressionwill not overflow prematurely and because of infinity arithmetic will have the

correct value when x= 0: 1/ (0 + 0-1) = 1/ (0 + ) = 1/ = 0. Withou t infin ityarithmetic, the expression 1/ (x + x-1) requires a test for x = 0, which not on lyadds extra instructions, but may also disrupt a pipeline. This exampleillustrates a general fact, nam ely that infinity arithm etic often avoids the n eedfor special case checking; how ever, formulas n eed to be carefully inspected to

make sure they d o not h ave spurious behavior at infinity (as x/ (x2 + 1) did).

Signed Z ero

Zero is represented by the exponent emin - 1 and a zero significand. Since the

sign bit can take on two different values, there are tw o zeros, +0 and -0. If adistinction w ere mad e wh en comp aring +0 and -0, simple tests like if (x = 0)wou ld have very unp redictable behavior, depend ing on the sign ofx. Thusthe IEEE standard defin es comparison so th at +0 = -0, rather tha n -0 < +0.

Although it wou ld be p ossible always to ign ore the sign of zero, the IEEEstandard does not do so. When a multiplication or division involves a signedzero, the usual sign rules apply in computing the sign of the answer. Thus3(+0) = +0, and +0/ -3 = -0. If zero did not h ave a sign, then the relation1/ (1/x) = x wou ld fail to hold wh en x = . The reason is that 1/ - and 1/ +both result in 0, and 1/ 0 results in +, the sign information having been lost.One w ay to restore the identity 1/ (1/x) = x is to only have on e kind of infin ity,however that would result in the disastrous consequence of losing the sign ofan overflowed quantity.

Another example of the use of signed zero concerns underflow and functionsthat h ave a discontinuity at 0, such a s log. In IEEE arithmetic, it is natu ral todefin e log 0 = - and log x to be a NaN when x < 0. Sup pose that x represents asmall negative number that h as und erflowed to zero. Thanks to signed zero,xwill be negative, so log can retur n a NaN. How ever, if there w ere no signedzero, the log function could not d istinguish an und erflowed n egative nu mberfrom 0, and would therefore have to return -. Another exam ple of a functionwith a d iscontinu ity at zero is the signum function, which return s the sign of anumber.

ema x

2


33/74

29

Probably the m ost interesting u se of signed zero occurs in complex arithmetic.

To take a simple example, consider the equation . This is

certainly true when z 0. Ifz = -1, the obvious comp utation g ives

and . Thus ! The

problem can be traced to the fact that square root is multi-valued, and there isno w ay to select the values so that it is continu ous in the entire complex plane.How ever, square root is continuou s if a branch cutconsisting of all negative realnu mbers is exclud ed from consideration. This leaves the problem of wh at to dofor the negative real numbers, which are of the form -x + i0, where x > 0.Signed zero provides a perfect way to resolve this problem. Numbers of the

form x + i(+0) h av e on e sign an d n um bers of th e form x + i(-0) on the

o ther side o f t he b ranch cu t have t he o ther sign . In fact , t he nat ur al

formulas for computing will give these resul ts .

Back to . If z =1 = -1 + i0, then 1/z = 1/ (-1 + i0) =

[(-1- i0)]/ [(-1 + i0)(-1 - i0)] = (-1 - i0)/ ((-1)2 - 02) = -1 + i(-0), and so

, w hile . Thus IEEE arithmetic

preserves this iden tity for allz. Some m ore sophisticated examples are given by

Kahan [1987]. Although distinguishing between +0 and -0 has advantages, itcan occasionally be confusing. For examp le, signed zero d estroys the relationx =y 1/x = 1/y, wh ich is false wh en x = +0 and y = -0. However, the IEEEcommittee decided that th e advan tages of utilizing the sign of zero outw eighedthe d isadvantages.

Denormalized Numbers

Consider normalized floating-point num bers with = 10,p = 3, and emin = -98.The numbers x = 6.87 10-97 an d y = 6.81 10-97 appear to be perfectlyordinary floating-point nu mbers, wh ich are more th an a factor of 10 larger than

the smallest floating-point nu mber 1.00 10-98. They have a strange property,

however:x y = 0 even though x y! The reason is thatx -y = .06 10 97 = 6.0 10-99 is too small to be represented as a normalizednum ber, and so m ust be flushed to zero. How important is it to preserve theproperty

x = y x - y = 0 ? (10)

1 z 1 z( )=

1 1( ) 1 i= = 1 1( ) 1 i i= = 1 z 1 z( )

i x( )

i x( )

1 z 1 z( )=

1 z 1 i 0( )+ i= = 1 z( ) 1 i i= =


34/74


Its very easy to ima gine wr iting th e code fragment,if (x y) then z = 1/(x-y), and m uch later having a program fail du e to aspu rious d ivision by zero. Tracking dow n bu gs like this is frustrating an d timeconsum ing. On a m ore ph ilosoph ical level, compu ter science textbooks oftenpoint out that even though it is currently impractical to prove large programscorrect, designing programs with the idea of proving them often results inbetter code. For examp le, introdu cing inv ariants is quite useful, even if theyarent going to be u sed as par t of a p roof. Floating-point code is just like any

other code: it helps to have provable facts on which to depend. For example,when analyzing formula (6), it was very helpful to know thatx/ 2 < y < 2x x y = x -y. Similarly, knowing that (10) is true makes writingreliable floating-point code easier. If it is only tru e for m ost nu mbers, it cannotbe used to p rove anything.

The IEEE standard uses denormalized 18 numbers, which guarantee (10), aswell as other u seful relations. They are th e most controversial part of thestandard and probably accounted for the long delay in getting 754 approved.Most high performance hardware that claims to be IEEE compatible does notsupp ort denormalized nu mbers directly, but rather trap s wh en consuming orproducing denormals, and leaves it to software to simulate the IEEEstandard.19 The idea behind denormalized numbers goes back to Goldberg[1967] and is very simp le. When the exp onent is emin , the significand does not

have to be normalized, so that when = 10,p = 3 and emin = -98, 1.00 10-98 isno longer the sm allest floating-point n um ber, because 0.98 10-98 is also afloating-point number.

There is a small snag when = 2 and a hidden bit is being used, since anum ber with an exponent ofemin will alw ays have a significand greater than orequal to 1.0 because of the im plicit leading bit. The solution is similar to thatused to represent 0, and is sum mar ized in Table 2 on pa ge 25. The expon entemin is used to represent d enorm als. More forma lly, if the bits in the significandfield are b1, b2, , bp - 1, and the value of the exponent is e, then w hen e > emin- 1, the nu mber b eing represented is 1.b1b2cbp - 1 2e wh ereas when e = emin -1, the nu mber being rep resented is 0.b1b2bp - 1 2e

+ 1. The +1 in the exponentis needed because d enormals have an exponent ofemin , not emin - 1.

Recall the example of = 10, p = 3, emin = -98, x = 6.87 10-97

an dy = 6.81 10-97 presented at the beginning of this section. With d enorm als,x -ydoesnt flush to zero but is instead represented by th e denorm alized n um ber.6 10-98. This behavior is called grad ual underflow. It is easy to verify that (10)always holds w hen using gradu al underflow.

18. They are called subnormal in 854, denormal in 754.

19. This is the cause of one of the most troublesome aspects of the standard. Program s that frequentlyund erflow often run noticeably slower on hardwa re that uses software traps.


35/74

31

Figure 2 Flush To Zero Comp ared With Gradu al Underflow

Figure 2 illustrates denormalized numbers. The top number line in the figureshows norm alized floating-point nu mbers. Notice the gap between 0 and thesm allest n orm alized nu mber . If th e resu lt of a floatin g-p oin tcalculation falls into this gulf, it is flush ed to zero. The bottom n um ber lineshows w hat happ ens when d enormals are add ed to the set of floating-pointnu mbers. The gu lf is filled in, an d w hen th e result of a calculation is lessth an , it is rep resen ted by th e n earest den orm al. Wh en d en orm alizednum bers are added to the nu mber line, the spacing between adjacent floating-

point numbers varies in a regular way: adjacent spacings are either the samelength or differ by a factor of. Without denormals, the spacing abruptlychanges from to , which is a factor of , rather than theorderly change by a factor of. Because of this, many a lgorithms that can h avelarge relative error for normalized numbers close to the underflow thresholdare well-behaved in this range when grad ual und erflow is used.

Without gradual underflow, the simple expression x - y can have a very largerelative error for normalized inputs, as was seen above for x = 6.87 10-97 an dy = 6.81 10-97. Large relative errors can hap pen even w ithout cancellation, asthe following examp le shows [Demm el 1984]. Consider d ividing tw o comp lexnumbers, a + ib an d c + id. The obvious formula

i

suffers from the problem that if either component of the denominator c + idis

larger th an , th e form ula will overflow, ev en th ou gh th e fin al resu lt

may be well within range. A better method of computing the quotients is touse Smiths formu la:

0 emin emin 1+ emin 2+ emin 3+

0 emin

emin

1+

emin

2+

emin

3+

1.0 emin

1.0 emin

p 1+ emin

emin p 1

a ib+c id+

ac bd +

c2

d2+

bc ad

c2

d2++=

emax

2


36/74


(11)

App lying Smiths formu la to (2 10-98 + i10-98)/ (4 10-98 + i(2 10-98)) gives thecorrect answer of 0.5 with gradual underflow. It yields 0.4 with flush to zero,an error of 100 ulps. It is typical for den ormalized nu mbers to gu arantee error

bounds fo r a rgum en ts a ll the way down to 1.0 x .

Exceptions, Flags and Trap Handlers

When an exceptional condition like division by zero or overflow occurs in IEEEarithmetic, the default is to d eliver a resu lt and continue. Typical of the default

results are NaN for 0/ 0 and , and for 1/ 0 and overflow. The precedingsections gave examples wh ere proceeding from an exception w ith these default

values w as the reasonable thing to d o. When any exception occur s, a status flagis also set. Imp lementations of the IEEE stand ard a re required to p rovide u serswith a w ay to read and wr ite the status flags. The flags are sticky in that onceset, they remain set u ntil explicitly cleared. Testing the flag s is the only w ay todistinguish 1/ 0, w hich is a genuine infinity from an overflow.

Sometimes continu ing execution in the face of exception conditions is not

approp riate. Infinity on p age 27 gave the example ofx/ (x2 + 1). When x >

, the den ominator is infinite, resulting in a final answ er of 0, wh ich is

totally wrong. Although for this formula the problem can be solved by

rewriting it as 1/ (x + x-1), rewriting may not always solve the problem. TheIEEE standard strongly recommends that implementations allow trap handlersto be installed. Then w hen a n exception occurs, the tra p h and ler is called

instead of setting the flag. The value returned by the trap handler will be usedas the result of the op eration. It is the responsibility of the trap han dler toeither clear or set the statu s flag; otherwise, the value of the flag is allowed tobe undefined.

The IEEE stand ard divid es exceptions into 5 classes: overflow, un derflow,division by zero, invalid operation and inexact. There is a separate statu s flagfor each class of exception. The meaning of the first three exceptions is self-evident. Invalid operation covers the situations listed in Table 3 on p age 26,and an y comp arison that involves a NaN. The default result of an operation

a ib+c id+

a b d c( )+c d d c( )+

ib a d c( )c d d c( )+

if d c


37/74

33

that causes an inva lid exception is to return a NaN, but the converse is not true.When one of the operands to an operation is a NaN, the result is a NaN but noinvalid exception is raised u nless the operation a lso satisfies one of theconditions in Table 3 on pa ge 26.20

*x is the exact result of the operation, = 192 for single pr ecision, 1536 for dou ble, and

xma x = 1.11 11 .

The inexact exception is raised w hen the result of a floating-point opera tion isnot exact. In th e = 10, p = 3 system, 3.5 4.2 = 14.7 is exact, but3.5 4.3 = 15.0 is not exact (since 3.5 4.3 = 15.05), and raises an inexactexception. Binary to Decimal Conversion on p age 59 discusses an algorithmthat u ses the inexact exception. A su mm ary of the beha vior of all fiveexceptions is given in Table 4.

There is an implemen tation issue connected w ith the fact that the inexactexception is raised so often. If floating-point h ardw are doesn t have flags of itsown, but instead interrupts the operating system to signal a floating-pointexception, the cost of inexact exceptions could be prohibitive. This cost can beavoided by having the status flags maintained by software. The first time anexception is raised, set the software flag for the ap prop riate class, and tell thefloating-point hard wa re to mask off that class of exceptions. Then a ll furth erexceptions will run without interrupting the operating system. When a userresets that status flag, the hardware mask is re-enabled.

20. No invalid exception is raised u nless a trapp ing NaN is involved in the op eration. See section 6.2 of IEEEStd 754-1985. -- Ed.

Table 4 Exceptions in IEEE 754*

Exception

Result when traps

disabled

Argument to trap

handler

overflow or xmax round(x2-)

underflow 0, or denormal round(x2)

divide by zero operands

invalid NaN operands

inexact round(x) round(x)

2emin

2

emax


38/74


Trap Handlers

One obvious use for trap handlers is for backward compatibility. Old codesthat expect to be aborted when exceptions occur can install a trap handler thataborts th e p rocess. This is especially u seful for cod es w ith a loop likedo S until (x >

floating point arithmetic sun

Documents