implementation of high speed low power ieee 754 floating point addition/subtraction using xilinx...

Implementation of High Speed Low Power IEEE 754 Floating Point

Addition/Subtraction Using Xilinx 12.3

Reham Murrar Computer Engineer Department, Faculty of Computer and Information Technology

Jordan University of Science and Technology, Irbid-Jordan

Email: [email protected]

Noor Awad Computer Engineer Department, Faculty of Computer and Information Technology

Jordan University of Science and Technology, Irbid-Jordan

Email: [email protected]

Abstract

This paper presents an IEEE 754 floating point Addition/Subtraction design. This design is

prepared by Reham Murrar and Noor Awad, as a research project at Jordan University for Science

and Technology, under the supervision of Dr. Khaldoon Mhaidat. It is designed for normalized

single precision IEEE 754 floating point numbers, to give the correctly normalized sum/difference

in the IEEE 754 standard representation. Our floating point design achieves high speed, low power

and minimum number of slice registers, by combining these several optimization techniques

together in one design. This adder takes any two binary numbers as inputs and checks for

normalized or de-normalized numbers, positive or negative numbers, infinity and floating point

numbers. It supports additional input that controls the selected operation; add or sub. It also

supports two output flags; overflow and underflow. All of these features are designed using Xilinx

ISE Design Suit 12.3. At the end of this paper we make a quick look over another single precision

floating point addition algorithms from other papers and literature, and compare these algorithms

with our algorithm.

Key words: IEEE 754 floating point, addition, subtraction, CSA, carry ripple adder, carry select

asdder.

mailto:[email protected]

1. Introduction

Addition and subtraction are the most used

operations in floating point. In this paper we

present our design for single precision floating

point addition/subtraction that supports IEEE

754 standard format. This design has a

powerful enhancement in speed, area and

power compared with other designs when

they are compared. This design is

implemented in Xilinx software, of version

12.3, using the Virtex-6 family, and introduces

single precision floating point

addition/subtraction.

For the enhancement, several optimization

techniques we take in consideration while

building this fast design:

a) For the speed and area enhancements,

there is an enhancement from the software

itself that make my design fast enough and use

minimum area. And the second enhancement

is in the hardware design, by increase the

using of carry save adders and use the idea of

carry select adder since they are very fast, and

decrease the using of carry ripple adders since

they are too slow.

b) Increase the use of shifts, and replace

some complex operations such as addition and

multiplication by using the shifts, since they

are too fast compared to the first ones.

c) Post-normalization is advanced step,

but still important step to get true results.

d) For the power reduction enhancements,

we make the enhancements of it by the

enhancement that the Xilinx offer to me.

e) Check the input numbers is the first

step, and it is important step to detect that the

two input numbers are in the allowable range

of the floating point numbers, and we can

make the addition or the subtraction over

these numbers over them or not.

This paper focuses on the implementation of

high speed floating point addition and

subtraction operations. Implementations of

different algorithms on Xilinx are investigated.

Results for single precision of the implemented

operations are analyzed and compared with

other algorithms presented in papers. The

design and its implementation will be

explained in more details in the following

section.

2. Floating Point

Addition/Subtraction Algorithm

2.1. IEEE Floating Point Standard

representation IEEE 754 has three types of format: Single,

Double and Quad. Here in this paper we focus

only on the first format and make big

enhancement in its hardware implementation

to make fast addition or subtraction of two

binary numbers with minimum power

consumption.

For the IEEE 754 Single Precision floating

point representation, a number in this

format is consists of 32 bits that are

divided into three regions as shown in

Figure 1:

Sign Bit: this bit detect if the number is

positive or negative, and locate on the most

significant bit; bit 31.

Exponent Bits: it is 8 bits, and have a range

from -126 127, and its located from bits

23 – 30.

Fraction Bit: and these bits represent the

fraction of a number, and it occupies the

lower 23 bits of the 32 bits number.

Figure 1: IEEE 754 Floating Point Standard Representation

The standard formula for the IEEE standard

Single precision of floating point is as follow:

Before start making the addition or

subtraction, we first check the number if it is in

the range and does not have any special case

number such as NaN or Infinity or zero or de-

normalize etc. these special numbers for the

single precision representation can be

computed as shown below:

Positive Infinity: is presented when the

exponent takes the maximum number, and the

fraction is zero and the sign is positive.

Negative Infinity: is presented when the


fraction is zero but the sign is negative.

Positive NaN: is presented when the


fraction is not a zero and the sign is positive.

Negative NaN: is presented when the


fraction is not a zero but the sign is negative.

Positive Zero: is presented when the

exponent takes the minimum number (zero),

and the fraction is zero and the sign is positive.

Negative Zero: is presented when the

exponent takes the minimum number (zero),

and the fraction is zero but the sign is negative.

Positive Denormalized: is presented

when the exponent takes the minimum

number (zero), and the fraction is not zero and

the sign is positive.

Negative Denormalized: is presented

when the exponent takes the minimum

number (zero), and the fraction is not zero but

the sign is negative.

Otherwise, the number will be in the range of

the number that accepts any operation on it.

2.2. Bitwise Algorithm

This first phase is a functional phase that we

don’t interested in the speed nor in the power

reduction; we just interested in taking the

inputs and give the correct output, and

compare it with the actual result.

In Bitwise Full Adder, the main module that

does this task is the FA module. This module is

basically a combinational model that is

collection of logical gates, and is executed 23

times taking 1 bit execution each time, as

shown in the two following equations:

SUM = A ^ B ^ Cin

Cout = (A&B) + (A&Cin )+ (B&Cin )

Where A,B and Cin represent only one bit of the

fraction.

2.3. Plus operator Algorithm

This implementation is the second one that we

try to make some enhancement in speed, by

changing the design of bitwise to full size, by

using a plus (+) operator for the full size as

follow:

SUM = A + B

Where A,B represent the full size of fraction;

which is 23 bits.

3. Implementation of High Speed,

Less Power Consumption

Algorithm

3.1. Carry save adder with carry

select Adder Algorithm

The last design that we build is differ at all

from the previous ones, this design is a

hierarchy of carry save adders using the idea

of carry select adder also, and in last level a

carry ripple adder is used.

This adder takes three inputs: the two binary

numbers that we want to take their sum or

difference, and a one bit control to determine

the operation as shown in Figure 2. It has six

outputs: the result, last carry, two flags:

overflow and underflow and two control fields

for the two inputs; such as NaN, Infinity,

zero…etc. the Op input bit is used to select the

operation; add or sub as shown in Table 1.

Figure 2: Floating Point Addition/Subtraction block diagram

Table 1: Operation Control Feild

op Operation

0 Addition

1 Subtraction

3.1.1. The Algorithm

First, the two binary inputs are checked using

the FPControl module as shown in Figure 3 for

being negatives (NegN), infinities (InfN), not a

numbers (NAN), de normalized numbers

(DenN), zeros (ZN) and floating point numbers

(FPN). As shown in Table 3, the input is zero if

its exponent and its fraction in zeros, whether

an input is negative if the most significand is 1,

the input is de normalized number if its

exponent is zero where its fraction not equals

to zero, the input is infinity when its exponent

equals to 255 where its fraction not equals to

zero and the input is a floating point number in

single precision format if its exponent less

than 255 and its fraction greater than zero.

This check ensures that all input comparisons

will be handled correctly and the operation

will be done when the signal (FPN) is high for

the two inputs.

Figure 3: The Input Control block diagram

Table 2: IEEE 754 Encoding Of Floating Point Number

The inputs are then split to have the exponent,

the fraction and the sign that are sent to the

FPAddition Module as shown in Figure 4. Since

the inputs are normalized we define two

temporary registers that have a size of 24 bits,

each one has 1 bit at the MSB and the fraction

at the lower bits. Then the two signs together

with the control operation (op) are checked to

have eight different cases to determine the

operation. For each case the two inputs are

compared according to their exponents and

fraction bits to find the larger one, the larger

exponent will be the exponent for the result,

then the smaller exponent will be subtracted

from the larger one to get the shift count that

the smaller input will be shifted to get the

same exponent of the larger one to do a correct

addition.

Figure 4: FPAddition block diagram

The addition is done using the tree of carry

save adders (CSA) together with carry select

adders as shown in Figure 5. The idea is as

follows, the two 24 faction bits are split into

two 12 bits, the two lower 12 bits are added

using a CSA that has a size of 12 bits that

consists of two CSA of size 6 bits, the 6-bits

CSA consists of two carry propagate adders

(CPA) that have a size of 3 bits. In each stage

the bits are added twice except the first one

since its cin is equal to 0, the first one will add

the bits when the cin is 0 and the second one

will add the bits when the cin is 1. Thus a 2:1

mux is used to select the correct sum

according to the cout value of the previous

stage. That will ensures that the addition will

done in a fast way since all the carry save

adders will work in parallel and there is no

carry propagation except in the carry

propagate adders stage. The upper 12 bits are

added in the same way as the lower 12 bits.

Figure 5: design of the fast CSA, CPA and carry select adder

Object Represented

Single Precision

Fraction Exponent

0 0 0

+- DenN Nonzero 0

+- FPN Anything 1-254

+- InfN 0 255

NaN nonzero 255

Taking into account that we set the register to

24 bits although the fraction size is just 23 bits

to save the last carry.

This design is fast enough since all operations

are occurred in parallel; there is no

dependency between the bits except in the last

level, a carry ripple adder is used.

But the question is: how each two adjacent

propagation adders or two adjacent carry save

adders (CSA) communicate with each other??

The answer is simply by using the idea of carry

select adder; that is use 2X1 MUX and set the

control selection bit to be the carry of the

lower part to choose among the two sums;

sum with cin =0, and sum with cin =1.

As we mention before, a carry ripple adder is

used for each 3 bits as the following equations

shows:

Gi = Ai & Bi

Pi = Ai + Bi

Ci+1 = gi + (Pi & Ci)

SUMi = gi ^ Pi ^ Ci

After that, the final sum that is 24 bits and the

final cout are sent to the Normalized module as

shown in Figure 6. This module will convert

the sum to a normalized single precision

format as the inputs.

Figure 6: NormilizedResult block diagram

4. Simulation & Synthesis Results

After building each design, a test is made by

inserting a testbench file to the top module,

and gives the inputs values, and see the results

from the signals waveform that appears from

the ISim Simulator.

We take an example of two binary numbers:

Figure 7:Input to the testbench file

A:

Sign 0 (positive)

Exponent (10001010)2 = (138)10

Fraction (00100101001001000000000)2 = (1.4508056e-1)10

So, the overall value of the number A is (2345.125)10

B:

Sign 0 (positive)

Exponent (01111110)2 = (126)10

Fraction (10000000000000000000000)2 = (1.5)10

So, the overall value of the number B is (.75)10

Now, from the last figure, the operation is 0,

which means addition of the two numbers A &

B.

As we mentioned before, the addition for the

floating point is not as simple as the addition

of any other numbers, because of the exponent

part and sign.

Now, from the previous discussion, the two

numbers A & B are checked for the controls

bits, and results that the two numbers are

okay, and the signal FPN=1.

Then, the two fraction is concatenated with 1

in the most significant bit, to have a fraction 24

bits. Case 1 is applied to this test case,

(signA=0, signB=0, and op=0).

By go to case 1, the third branch is applied,

which is ExpA > ExpB, this means that:

The result exponent = ExpA = 0

And result sign is = signA = 0

B= (B>>(ExpB - ExpA)) B = 0000 0000 0000

11 0000 0000 00

A = 1001 0010 1001 0010 0000 0000

Then, the hierarchy of executing the carry save

adder and carry ripple adder is done, resulting

to give the expected output to be as follow:

Figure 8: result of the floating point addition of two inputs A and B

And by comparing the previous expected

result with the actual result that comes from

ISim Simulator, we expect that our results are

okay, and everything is correct.

Figure 9: ISim simulator shows the actual result of the previous addition

Now, by opening the design/summary reports

that give some important information that we

interested in such as number of slice, timing

constraints in its best and worst cases, and

power reports.

By comparing the results of the three designs,

we can get the following figure, which shows

the speed statistics:

Figure 10: comparison in our three designs in area, speed and power consumption

5. Related Work Our adder was synthesized using Xilinx ISE

Design Suit 12.3 on a family of Spartan 6 and a

device of XC6SLX45. Number of slice registers

that our adder takes equal to 67, and its speed

in the normal option equals to 0.14ns in the

worst case, in the high option, its speed equals

to .139ns and in the fast option, its speed

equals to .133ns. The power on-chip that our

adder takes equals to 1007.33 mw.

By comparing our designs with other designs

that IEEE offer, we can conclude that our

algorithm is the best in the speed and power

consumption as shown in the following figure,

that compares several adders from different

sources with our fast design.

Figure 11: comparison between our design and other designs from different sources

References

[1]: Liang-Kai Wang, Schulte, M.J., Thompson

J.D, Jairam, N, “Hardware Designs for

Decimal Floating-Point Addition and

Related Operations”, IEEE TRANSACTIONS

ON COMPUTERS, VOL. 58, NO. 3, MARCH

2009.

[2]: Shao Jie, Ye Ning, Zhang Xiao-Yan, “The

Implementation of High-speed Floating-

point DBF Based on FPGA”, International

Journal of Digital Content Technology and its

Applications. Volume 5, Number 7, July 2011.

[3]: M.D. Ercegovac and T. Lang, “Digital

Arithmetic.” San Francisco: Morgan

Daufmann, 2004. ISBN 1-55860-798-6.

[4]: Subhash Kumar Shrama, Himanshu

Pandey, Shailendra Sahni, and Vishal Kumar

Srivastava, “

Implementation of IEEE-754 Addition and

Subtraction for Floating Point Arithmetic

Logic Unit”, International Transactions in

Mathematical Sciences and Computer Volume

3, No. 1, 2010, pp. 131-140 ISSN(Print)-0974-

5068.

[5]: W.-C. Park, S.-W. Lee, O.-Y. Kown, T.-D. Han,

and S.-D.Kim, “Floating point

adder/subtractorperforming ieeerounding and addition/subtraction in parallel”, IEICE Transactions on Information and Systems, E79-

D(4):297–305, Apr. 1996.

implementation of high speed low power ieee 754 floating point addition/subtraction using xilinx...

Documents