hardware-software codesign of a programmable wireless

ABSTRACT

GUPTE, RUCHIR. Interval Arithmetic Logic Unit for DSP and Control Applica-

tions. (Under the direction of Prof. William W. Edmonson).

There are many applications in the field of digital signal processing (DSP) and

controls that require the user to know how various numerical errors (uncertainty)

affect the result. Interval Arithmetic (IA) eliminates this uncertainty by replacing

non-interval values with intervals. Since most DSPs operate in real time environ-

ments, fast processors are needed. The goal is to develop a platform in which

interval arithmetic operations are performed at the same computational speed as

present day signal processors.

This thesis proposes a design for an interval based arithmetic logic unit (I-ALU)

whose computational time for implementing interval arithmetic operations is equiv-

alent to many digital signal processors. Many DSP and control applications require

a small subset of arithmetic operations that must be computed efficiently. This de-

sign has two independent modules operating in parallel to calculate the lower bound

and upper bound of the output interval. The functional unit of the ALU performs

the basic fixed-point interval arithmetic operations of addition, subtraction, multi-

plication and the interval set operations of union and intersection. In addition, the

ALU is optimized to perform dot products through the multiply-accumulate instruc-

tion. Division is not implemented on digital signal processors traditionally unless

computed with a shift operation. In this design, division by shifting is implemented.

One of the prime design goals is to maximize the throughput of the ALU for

an optimum value of area. Pipelining is implemented to achieve this design goal.

Power dissipation analysis of different ALU architectures is done. Since it required

to obtain maximum throughput for the least power dissipation, throughput per

unit power dissipation is used as the most critical performance metric. This thesis

studies several architectures for the ALU and concludes with the one with the

highest performance amongst the ones which are studied.

Interval Arithmetic Logic Unit for DSP and Control Applications

by

Ruchir Gupte

A thesis submitted to the Graduate Faculty ofNorth Carolina State University

in partial fulfillment of therequirements for the Degree of

Master of Science

Electrical and Computer Engineering

Raleigh

2006

Approved By:

Dr. Winser E. Alexander Dr. William Rhett Davis

Dr. William W. Edmonson

Chair of Advisory Committee

To

Mom, Dad and Sheetal

ii

Biography

Ruchir Gupte was born on December 9th, 1982 in Mumbai, India. He received

his Bachelor of Engineering (B.E.) degree in Electronics and Telecommunications

from the Mumbai University in June 2004. In Fall 2004, he began his graduate

studies in the Electrical and Computer Engineering Department at North Carolina

State University. Since Spring 2005, he has been working in the High Performance

Digital Signal Processing (HiPerDSP) Laboratory of Dr. Winser Alexander and Dr.

William Edmonson in the field of hardware support for interval analysis.

He worked at Sony Ericsson Mobile Communications Inc., Raleigh, as an intern

from May 2005 to August 2005. He has also taken keen interest in community

participation and has been a committee member of the NC State Indian Graduate

Student Association called MAITRI. Moreover, he has extended his support in volu-

teering for various events organized by the Department of Electrical and Computer

Engineering North Carolina State University.

iii

Acknowledgements

Above all, I thank my parents for the much needed motivation throughout the

duration of my stay away from home. It was their love and support that helped

me maintain sanity during stressful times. My sister, Sheetal, has been a great

inspiration for me throughout.

I sincerely acknowledge the efforts of Dr. William Edmonson, my academic advi-

sor, in providing guidance and encouragement for the successful completion of this

thesis. Dr. Edmonson has made available all resources that I could possibly need

and also allowed the independence of applying my ideas in this project. I am deeply

indebted to him for his patience and invaluable suggestions during the course of this

project.

I am also grateful to the other members of my thesis committee, Dr. Winser

Alexander and Dr. Rhett Davis for devoting their time and providing useful inputs.

Completion of this work would not have been possible without their guidance.

I sincerely wish to express my gratitude to the HiPer DSP Research group for

creating an environment that has been fabulous for research and fun. Additional

thanks to Ramsey Hourani and Senanu Ocloo for their unconditional help through-

out my stay in the group. The encouragement and moral support extended by all

members of the group through good and hard times cannot be described in words.

Special thanks to Ravi Jenkal for his inputs and help.

Finally, I would like to thank those near and dear to me, without whose backing,

this thesis would have been a distant reality. I am grateful to all my friends at

Raleigh for being there, a special mention to my roommate, Karan Tewari, for his

steady support and friendship.

iv

Contents

List of Tables vii

List of Figures viii

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Interval Arithmetic 92.1 Interval Arithmetic and Set Operations . . . . . . . . . . . . . . . . 10

3 Design Specifications of the ALU 163.1 Fixed-point two’s complement arithmetic . . . . . . . . . . . . . . . 16

3.1.1 Representation of numbers . . . . . . . . . . . . . . . . . . . 173.1.2 Arithmetic Operations . . . . . . . . . . . . . . . . . . . . . 18

3.2 Outward/Directed Rounding . . . . . . . . . . . . . . . . . . . . . . 21

4 Hardware Architecture 254.1 Overall Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 Flag Generator Module . . . . . . . . . . . . . . . . . . . . . . . . . 274.3 Lower Bound and Upper Bound Modules . . . . . . . . . . . . . . . 28

4.3.1 Functional Units and Control Logic . . . . . . . . . . . . . . 314.3.2 Special Case Multiplication Block . . . . . . . . . . . . . . . 324.3.3 Multiply-Accumulate Block . . . . . . . . . . . . . . . . . . 34

4.4 Rounding Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5 Pipeline Architecture of the Design . . . . . . . . . . . . . . . . . . 38

4.5.1 Need for Pipelining . . . . . . . . . . . . . . . . . . . . . . . 384.5.2 Partially Pipelined Design . . . . . . . . . . . . . . . . . . . 39

v

vi

4.5.3 Highly Pipelined Design . . . . . . . . . . . . . . . . . . . . 41

5 Testing and Results 445.1 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2.1 Non-Pipelined Architecture . . . . . . . . . . . . . . . . . . 475.2.2 Design with Pipeline Multipliers . . . . . . . . . . . . . . . . 495.2.3 Highly Pipelined Design . . . . . . . . . . . . . . . . . . . . 54

5.3 Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.3.1 Generating Input Vectors . . . . . . . . . . . . . . . . . . . 595.3.2 Statistical Results from Power Scripts . . . . . . . . . . . . . 60

6 Conclusions and Future Work 666.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

Bibliography 70

List of Tables

2.1 Nine Cases in Multiplication . . . . . . . . . . . . . . . . . . . . . . 12

3.1 Two’s complement fixed-point representation . . . . . . . . . . . . . 18

4.1 Description of ALU Inputs . . . . . . . . . . . . . . . . . . . . . . . 274.2 Description of ALU outputs . . . . . . . . . . . . . . . . . . . . . . 274.3 mul flag for the Multiplication operation . . . . . . . . . . . . . . . 284.4 Flag Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.1 Timing Reports for the Non-Pipelined Architecture . . . . . . . . . 475.2 Area Reports of Non-Pipelined Architecture . . . . . . . . . . . . . 495.3 Timing Reports for the Pipelined Multipliers . . . . . . . . . . . . . 515.4 Timing Reports for various Pipelined Architectures . . . . . . . . . 515.5 Area Reports for various Pipelined Architectures . . . . . . . . . . . 525.6 Timing Reports for Non-Pipelined and Highly-Pipelined Architectures 565.7 Results for Non-Pipelined and Highly-Pipelined Architectures . . . 565.8 Area Reports for Non-Pipelined and Highly-Pipelined Architectures 565.9 Power Dissipation for Different Architectures with 500 Input Vectors 605.10 Power Dissipation for 3-stage Pipelined Architecture . . . . . . . . . 625.11 Power Dissipation for All Stages with different Input Vectors . . . . 635.12 Throughput per unit Power Dissipation for All Architectures . . . . 63

6.1 Comparison of Non-Pipelined and Highly-Pipelined Designs . . . . 68

vii

List of Figures

3.1 Two’s complement number representation . . . . . . . . . . . . . . 17

4.1 Top Level Block Diagram of the ALU . . . . . . . . . . . . . . . . . 264.2 Flag Generation Module . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Block Diagram of Lower Bound Module . . . . . . . . . . . . . . . . 304.4 Block Diagram of Upper Bound Module . . . . . . . . . . . . . . . 304.5 Lower Bound Module . . . . . . . . . . . . . . . . . . . . . . . . . . 324.6 Upper Bound Module . . . . . . . . . . . . . . . . . . . . . . . . . . 334.7 Special Case Multiplication . . . . . . . . . . . . . . . . . . . . . . 344.8 Multiply-Accumulate Module . . . . . . . . . . . . . . . . . . . . . 354.9 Lower Bound Rounding . . . . . . . . . . . . . . . . . . . . . . . . 364.10 Upper Bound Rounding . . . . . . . . . . . . . . . . . . . . . . . . 374.11 Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.12 Non-Pipelined Multiplier Architecture . . . . . . . . . . . . . . . . 404.13 Two-stage Pipelined Multiplier Architecture . . . . . . . . . . . . . 404.14 Three-stage Pipelined Multiplier Architecture . . . . . . . . . . . . 404.15 Four-stage Pipelined Multiplier Architecture . . . . . . . . . . . . . 414.16 Five-stage Pipelined Multiplier Architecture . . . . . . . . . . . . . 414.17 Highly Pipelined Architecture . . . . . . . . . . . . . . . . . . . . . 42

5.1 Simulation Results for Add, Subtract and Multiply . . . . . . . . . 465.2 Simulation Results for Interval Union and Intersection . . . . . . . 465.3 Timing Reports of Non-Pipelined Architecture . . . . . . . . . . . . 485.4 Area Reports for the Non-Pipelined Architecture . . . . . . . . . . 505.5 Timing Reports of Different Pipelined Architecture . . . . . . . . . 535.6 Area Reports of Different Pipelined Architecture . . . . . . . . . . . 545.7 Timing and Area Reports of Different Pipelined Architectures . . . 555.8 Timing Reports of Non-Pipelined and Highly-Pipelined Architectures 575.9 Area Reports of Non-Pipelined and Highly-Pipelined Architectures 585.10 Power Dissipation for 500 Input Vectors . . . . . . . . . . . . . . . 61

viii

ix

5.11 Power Dissipation for different Input Vectors for 3-stage PipelinedMultiplier Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.12 Power Dissipation for different Input Vectors for All Architectures . 645.13 Throughput per unit Power Dissipation for All Architectures . . . . 65

Chapter 1

Introduction

Interval Arithmetic (IA) performs computations on intervals of real numbers

instead of real numbers themselves. It takes into account the numerical errors that

occur due to performing arithmetic on a computer. This is a problem that occurs

on all computers that make use of binary number systems, such as the IEEE 754

Standard for Binary Floating-Point Number Systems [1]. As of now, implementa-

tion of interval arithmetic is performed in software. The GNU Fortran Compiler

has been modified to provide support for an interval data type [2], based on the

Interval Arithmetic Specification [3]. The SUN Studio Fortran95 compiler provides

support for interval operations [4]. The SUN studio has compilers and tools that

support C and C++ development as well [5]. The main disadvantage of software

implementation is the slow speed. They incur tremendous overhead due to reasons

such as changing of rounding modes, function calls, exception handling, memory

management et al. For instance, the operation of multiplication requires several

conditional branches to determine which interval end-points need to be multiplied.

Based on the values of the input intervals relative to zero, nine different cases of

multiplication have to be accounted for to select the end-points. A large number

of conditional statements are required to select between these multiplication cases.

The extra work due to individual operations being done sequentially requires several

1

Chapter 1 Introduction

time consuming steps. The performance penalty to be paid for misprediction of con-

ditional branches is quite heavy in the case of fully pipelined processors. Changing

of rounding modes in software also requires a large number of computational cycles.

On many processors, changing the rounding mode causes the entire floating-point

pipeline to be flushed, which results in a delay of several cycles and severely limits

parallel execution. Furthermore, software implementations of interval multiplica-

tion are typically implemented as subroutines, which adds overhead for subroutine

calls and returns.

Thus, interval algorithms end up running slower on current computer architec-

tures compared to their real arithmetic counterparts [6]. Software implementations

are found to be as much as four times slower than functionally equivalent hardware

[7]. Hardware support is required to overcome these performance drops caused by

the above software issues.

Applications of digital signal processing (DSP) involve a very large number of

arithmetic operations, and the necessity of obtaining accurate results makes it im-

perative to perform reliable numerical computations. The goal of this design is to

solve problems in the DSP field with higher accuracy and at a faster rate. Since

software implementations are slow, it is necessary to build dedicated hardware in or-

der to achieve this goal. Interval methods form one of the solutions to reduce errors

resulting from numerical computations. This was the motivation behind building

an Interval ALU dedicated to DSP and Control applications.

Interval based algorithms continue to find applications as the solution for signal

processing and controls problems. For instance, in signal processing, there is usually

the need to determine the optimal solution to a problem, i.e., to minimize a cost

function. The ability of interval global optimization approaches to guarantee con-

vergence to global minimum point(s) is one that makes such approaches attractive

in DSP and control applications. Having optimum hardware for global optimiza-

tion, could reduce some of the overhead associated with a complete software solution

2

Chapter 1 Introduction

mentioned earlier. DSP and control algorithms need to be designed in such a way

that roundoff and truncation errors that occur naturally due to the discrete nature

of computing do not prevent the algorithm from converging to the global minimum.

Interval analysis provides a means of managing such errors. It is therefore possible

to obtain numerically accurate and reliable results. Reliable results may be defined

as the solutions in which the obtained value is guaranteed to be the exact result

of the operation being performed. Interval analysis gives an interval as the output

which certainly contains the exact value of the result expected from an operation

performed on two input interval numbers. It is thus capable of providing reliable

results.

These results can be achieved by using arithmetic logic units (ALU) that are

specially designed to manipulate interval numbers. Such an Interval ALU (I-ALU)

can be used as the core of a digital signal processor. In contrast to general pur-

pose microprocessors that are designed to handle general computing tasks, digital

signal processors are designed and optimized to operate on algorithms that are

characterized by repetitive multiply-and-add operations. In general, they feature

fast multiply-accumulate instructions, multiple-access memory, specialized program

control for interrupt handling and I/O, and fast and efficient access to peripherals.

We desire to achieve maximum efficiency while providing these features that make

up a good digital signal processor.

Throughput is the most important metric to analyze the performance of a DSP

system. The throughput problem will have to be solved for interval algorithms to

become more practical. The throughput of an I-ALU will have to be comparable to

that of non-interval units. Pipelining provides an effective solution to improving the

throughput of the design. By definition, pipelining is an implementation technique

where multiple instructions are overlapped in execution. Each stage completes a

part of an instruction in parallel. Pipelining does not decrease the time for individual

instruction execution. It increases instruction throughput, instead. Throughput of

a system is directly related to the depth of the pipeline. However, increasing the

3

1.1 Motivation

depth of the pipeline adversely affects the area of the design. Hence an optimum

design would involve a proper trade-off between the throughput and area, where

the throughput would have more importance in signal processing applications.

1.1 Motivation

Digital Signal Processing has become the choice for many applications related

to communications, control, multimedia, et al. because of the high performance it

achieves for applications that involve limited instruction set for implementing repet-

itive linear operations such as addition, multiplication, delay, et al. on a stream of

sampled data. Often a DSP has been used as an attached coprocessor or combined

with one or more FPGA devices to meet the performance and cost requirements

for a particular application. Common to DSPs is the ability to perform multiply-

accumulate (MAC) operations in a single instruction cycle. This operation is key to

performing vector products which is key to computing fourier transforms and corre-

lation. Other features of a DSP include the ability for accessing multiple memories,

dedicated address units for simultaneous access to data memories and program mem-

ory modulo addressing. Several DSP manufacturers also include specialized periph-

erals along with fast interrupt handling. Examples of these specialized peripherals

include analog-digital converters and I/O for multiprocessor communications.

Underlying many of these applications is the need for accurate and reliable re-

sults, but errors due to rounding, uncertainty of the data, quantization noise and

catastrophic cancelation in floating point computations can lead to inaccuracies.

Sometimes these inaccuracies can go unnoticed. For many applications in signal

processing, operations are recursive and act on a sequence of data. This implies

that numerical errors can grow unbounded over time. An efficient method for mon-

itoring and controlling these inaccuracies is to replace point arithmetic with interval

arithmetic. Interval methods are capable of bounding these inaccuracies.

4

1.2 Background

Digital signal processors represent one of the fastest growing segments of the

embedded world. Despite their vast use, DSPs present difficult challenges for pro-

grammers. Since computation speed is of critical importance to DSP applications,

DSPs focus on supporting fixed-point operations.

Use of fixed-point representation not only requires the programmers to deal with

mathematically sophisticated equipment, but also are required to deal with errors

that are introduced due the use of reduced-precision arithmetic. Although it would

be ideal to use floating-point arithmetic over fixed point, as a practical considera-

tion, fixed-point processors operate at a much faster rate than their floating-point

counterparts. Fixed-point DSPs execute at gigahertz range; floating-point DSPs

peak out in the 300-400 megahertz range. Fixed-Point DSPs enjoy another advan-

tage of being consumed in large volumes as a result of which their price per chip is a

fraction of the price of a floating point DSPs [8]. Fixed-point processors gain speed

and power efficiency over floating-point processors at the cost of reduced precision.

However, DSP applications rarely require the full dynamic range offered by floating-

point number system. This justifies the choice of using fixed point arithmetic for

our ALU design over floating point arithmetic.

As mentioned earlier, hardware solutions are needed over software implementation

to solve the speed problem. This brings up the idea of building an Arithmetic Logic

Unit dedicated to perform arithmetic operations on interval inputs in the fixed

point representation. The aim is to develop hardware that is optimized for interval

operations pertinent to signal processing applications such as addition, subtraction,

multiplication and multiply-accumulate.

1.2 Background

Interval algorithms have found their usage in applications such as global opti-

mization [9], [10], [11], function evaluation [12], finding roots of polynomials [13],

5

1.2 Background

solving differential equations [14], solving non-linear equations [15] et al. In most of

these cases, interval arithmetic is used to solve problems which cannot be efficiently

solved using conventional floating-point arithmetic.

Several software tools have been developed to support interval arithmetic. These

include interval arithmetic libraries in Fortran [16], [17], [18], C++ [19], extended

scientific programming languages such as PASCAL [20], C++ [21], Fortran [22],

[23] and interval-enhanced compilers [24]. Inspite of these developments in the field,

interval arithmetic has not gained popularity owing to the speed issues when com-

pared to conventional floating point methods. It is believed that the performance

of interval arithmetic needs to be within a factor of five of floating-point arithmetic

for it to gain general acceptance [25]. Hardware support for interval arithmetic is

required to achieve this high performance. Several interval based hardware designs

have been implemented, a few of which have been listed below:

• Hardware Interval Multipliers [26]

The author presents serial and parallel hardware units for interval multiplica-

tion. These units provide automatic interval end-point selection and correct

rounding of results. While the serial interval multiplier uses a single multiplier

unit, the parallel multiplier uses dual multipliers to compute the interval end-

points simultaneously. These multipliers provide a significant performance

boost for acceptable increases in area.

• A Combined Interval and Floating Point Multiplier [27]

This design is based on the approach that an interval multiplier can share

hardware with a existing floating point multiplier, thereby achieving the per-

formance benefits of a interval multiplier at relatively low costs. The design

resorts to software solutions to solve the uncommon case of multiplication

where both end-points contain zero. Interval multiplication requires only one

more cycle than floating point multiplication, and is one to two orders of

magnitude faster than software implementations of interval multiplication.

6

1.3 Contribution

• A Combined Interval and Floating Point Divider [28]

This design follows a similar approach as above, where an existing floating

point divider is modified to enable interval division on it. Based on the values

of interval inputs relative to zero, seven different cases of division are ad-

dressed. Interval division can be performed after modifying the floating point

divider with a 24% increase in area.

• A Combined Interval and Floating Point Comparator [29]

This design is an implementation of a combined interval and floating-point

comparator, which performs interval intersection, hull, minimum, maximum

and comparisons, as well as floating-point minimum, maximum and compar-

isons. It has around 98% more area than conventional floating-point compara-

tors and a worst case delay that is 42% greater.

• Variable Precision Interval Arithmetic Processors [30]

The author presents designs, arithmetic algorithms and software support for

a family of variable precision, interval arithmetic processors. These processors

give the programmer the ability to detect, and if desired, to correct the implicit

errors in finite precision numerical computations. The processors are two to

three orders of magnitude faster than software packages that provide similar

functionality.

However, all of the above architectures are designed for floating-point represen-

tation of numbers. Although these are high-precision computational units, they

have lower throughput than their potential fixed-point counterparts. As mentioned

earlier, they also have higher design complexity and hence are undesirable for DSP

applications.

1.3 Contribution

The following thesis designs the hardware architecture of an I-ALU and optimizes

it to achieve maximum efficiency. The whole architecture is based on fixed-point

7

1.4 Thesis Organization

two’s complement representation of numbers. Two’s complement representation

is most convenient to perform arithmetic because of its uniformity over positive

and negative numbers while performing operations and rounding. Although fixed-

point arithmetic reduces the precision of results, the precision provided by it is

sufficient for DSP related applications. Besides, it has the added advantage of

reduced complexity of the design and higher throughput. A basic hardware model

has been built at the RTL level of abstraction, and the design has been modified

for better efficiency by use of pipelined multipliers of increasing depths. These

designs have been explored and statistical data, based on the results of simulations

and synthesis, has been used to determine the most optimal solution. Throughput,

area, power dissipation and numerical reliability are the performance metrics used

for system evaluation.

1.4 Thesis Organization

Chapter 2 introduces the reader to the concept and conventional representation

of Interval Numbers. Various arithmetic and set operations which can be performed

on these numbers by the I-ALU are discussed in detail in this chapter. Chapter 3

provides the design specifications of the proposed ALU. The significance of using

two’s complement representation of numbers can be seen here along with the details

of the rounding issue. Chapter 4 describes in detail, the hardware architecture of

the ALU. A comprehensive description of each module constituting the ALU has

been given here. The issue of rounding has been addressed. Chapter 5 provides

the results of simulation runs and synthesis performed on different architectures of

the design. An exhaustive comparison of the results from the non-pipelined design

and various versions of the pipelined design has been done to arrive at an optimal

solution. Throughput, area and power dissipation are used as the performance met-

rics. Finally, I conclude my work with Chapter 6 providing the details of the future

work that can be done on the design to broaden its scope, improve functionality

and efficiency.

8

Chapter 2

Interval Arithmetic

In the words of Ramon E. Moore [31], “If we have, in addition to the results of

a computation, error bounds for the differences between the results and the exact

solution values, then no matter how these error values were obtained, by analytical

means or by further machine computations during or after the given computation,

it will always be the case that we have, in effect, for each exact result sought, a pair

of numbers: an approximate value and an error bound, or an upper and a lower

bound to the exact result.”

Real numbers can be infinite precision. All machines are inherently finite preci-

sion. Owing to this nature, real numbers are approximated to get them to machine

representable forms. This error bound may be considered as an uncertainty. Interval

Analysis is a means of representing this uncertainty by replacing single (fixed-point)

values by intervals. It provides a means of bounding the errors that accrue due to

the discrete nature of computing.

An interval number is defined to be an ordered pair of real numbers [a, b], such

that a ≤ b. Using the notation {x |P(x)} for “the set of x such that the proposition

P(x) holds,” we can write

9

2.1 Interval Arithmetic and Set Operations

[a, b] = {x |a ≤ x ≤ b } where x ∈ <

Using this convention, real numbers are represented as intervals with identical up-

per and lower bounds. Such intervals are called “Degenerate Intervals” and appear

to have the form [a, a]. The usual operations of addition, subtraction, multiplica-

tion, and division that are possible with real numbers are also defined for interval

numbers [32].

An interval number is also a set of real numbers. The interval number [a, b]

is a set of real numbers x such that a ≤ x ≤ b. Hence, set operations of union

and intersection can also be done on interval numbers. Section 2.1 discusses in

depth, the various arithmetic and set operations performed by the proposed ALU.

In interval arithmetic, the true result is guaranteed to lie within the resulting in-

terval. This is achieved by the Outward Rounding algorithm. Outward rounding

on an interval X = [a, b] is achieved by rounding the lower bound, a, to the largest

machine-representable number smaller than a, and the upper bound b, to the small-

est machine-representable number larger than b. This involves the use of the round

down and round up modes on the lower and upper bounds, respectively. Directed

Rounding capabilities, that is, the ability to round down or round up has been

available since the Intel 8087 chip. As a result, interval arithmetic is possible on

virtually any computer.


As described in the previous section, interval numbers are represented by an

ordered pair [a, b] such that a ≤ b. The arithmetic interval operations of addition,

subtraction and multiplication will be discussed in this section. The rules for ef-

ficiently implementing these operations are listed here. In addition, the rules for

10


the set operations of union and intersection along with the calculation of width and

mid-point of a single interval and also described.

In the following discussion, we consider two input interval numbers. They are

represented as [xL, xU ] and [yL, yU ], where xL and yL are the lower bounds and xU

and yU are the upper bounds of the two intervals. Except for one special case of set

union of two disjoint sets, all operations result in a single output interval number.

The outputs of various interval operations are obtained as follows:

• Addition

Addition of interval numbers is a straightforward operation where the lower

bound of the output interval is obtained from the sum of the lower bounds of

the input intervals, while the upper bound of the output interval is obtained

from the sum of the upper bounds of the input intervals.

Mathematically, this can be represented as:

[xL, xU ] + [yL, yU ] = [xL + yL, xU + yU ]

• Subtraction

In subtraction, lower bound of the output interval is obtained by subtracting

the upper bound of one interval number from the lower bound of the other

interval number. Similarly, upper bound of the output interval is obtained

by subtracting the lower bound of the second interval number from the upper

bound of the first interval number.


[xL, xU ] - [yL, yU ] = [xL − yU , xU − yL]

11


Table 2.1: Nine Cases in Multiplication

Case Condition Result

1 xL ≥ 0; yL ≥ 0 [xLyL,xUyU ]

2 xL ≥ 0; yL < 0 < yU [xUyL,xUyU ]

3 xL ≥ 0; yU ≤ 0 [xUyL,xLyU ]

4 xL < 0 < xU ; yL ≥ 0 [xLyU ,xUyU ]

5 xL < 0 < xU ; yU ≤ 0 [xUyL,xLyL]

6 xU ≤ 0; yL ≥ 0 [xLyU ,xUyL]

7 xU ≤ 0; yL < 0 < yU [xLyU ,xLyL]

8 xU ≤ 0; yU ≤ 0 [xUyU ,xLyU ]

9 xL < 0 < xU ;yL < 0 < yU [min(xUyL,xLyU),

max(xLyL,xUyU)]

• Multiplication

Multiplication presents a more difficult problem than addition and subtrac-

tion. Unlike these two operations, apart from the magnitude, the sign of the

operands also needs to be taken into consideration. Both, sign and magni-

tude of operands decide which two values are to be multiplied to obtain the

lower and upper bounds separately. Under normal circumstances, the result

of multiplication of two input intervals would be obtained as follows:

If [xL, xU ] ∗ [yL, yU ] = [zL, zU ], then,

zL = min(xLyL, xLyU , xUyL, xUyU) and

zU = max(xLyL, xLyU , xUyL, xUyU)

These computations require eight multiplications and several comparisons

to be performed before the lower and upper bounds of the intervals can be

obtained. This makes the multiplication operation highly inefficient. To over-

come this problem, the multiplication operation is split into 9 different cases

based on the values of the operands with respect to zero. Table 2.1 lists these

9 cases and gives the output interval for each of them.

12


From this table, it can be observed that the task of obtaining the lower

bound and the upper bound of the output interval is reduced to two multipli-

cations compared to the eight multiplications that were required when a brute

force method was followed. Comparisons of the input values need to be done

initially to determine which case they belong to. Reduction in the number of

multiplications required to be done to determine the output helps in making

the design more hardware efficient.

A special mention needs to be made of case 9, where both the input intervals

include zero in them. Although this would be a rare case in high resolution

processors, it needs to be addressed for the purpose of numerical reliability. As

can be seen from the table, the output for this case requires 4 multiplications

and 2 comparisons to be performed. This case leads to increased complexity

in the design and from the hardware point of view requires double the amount

of computational time as compared to other operations.

• Union of Interval Numbers

Union of interval numbers is done in the same way as the union operation

in set theory. By definition, for two sets A and B, (A ∪ B) is defined as a

set containing all elements of set A and all elements of set B. Similarly for

interval numbers, to perform the union operation, the lower bound is obtained

by determining the minimum value of the lower bounds of the two input

intervals while the upper bound is obtained by determining the maximum

value of the upper bounds of the input intervals.


[xL, xU ] ∪ [yL, yU ] = [min(xL, yL), max(xU , yU)]

For interval numbers, the union of two disjoint sets has to be dealt with

separately. This case results in two output intervals instead of one, the out-

13


put intervals being exactly equal to each of the input intervals. Amongst all

operations performed by the ALU, this is the only operation which results in

two output intervals.


For two disjoint intervals, [xL, xU ] and [yL, yU ],

[xL, xU ] ∪ [yL, yU ] = [xL, xU ] + [yL, yU ]

• Intersection of Interval Numbers

Intersection of interval numbers is done in the same way as the intersection

operation in set theory. By definition, for two sets A and B, (A ∩ B) is

defined as a set containing only those elements that belong to set A and to set

B. For the intersection operation, the lower bound is obtained by determining

the maximum value of the lower bounds of the two input intervals while the

upper bound is obtained by determining the minimum value of the upper

bounds of the input intervals.


[xL, xU ] ∩ [yL, yU ] = [max(xL, yL), min(xU , yU)]

A null set is obtained for the intersection of two disjoint sets.

• Width

The “width” operation is performed on a single interval. Width of an interval

is defined as the difference between the upper bound and lower bound of the

interval. The output is naturally a single value.


14


width[xL, xU ] = xU − xL

• Mid-point

The “mid-point” operation is also performed on a single interval. Mid-point

of an interval is obtained by taking the average of the lower bound and upper

bound of the input interval. Once again, the output is a single value.


midpoint [xL, xU ] = (xU + xL)/2

Division by 2 is performed by right shifting the sum of the two bounds of the

interval by one bit. Sign extension by one bit also needs to be done.

The operations described above will be implemented on the proposed I-ALU.

Unique to this work will be the fact that these operations in conjunction with

directed rounding will be based on fixed point arithmetic.

15

Chapter 3

Design Specifications of the ALU

The ALU is based on a parallel architecture where computation of the lower

bound and the upper bound of the output interval is simultaneously done. The

design is built for fixed point operation using the two’s complement representation

of numbers. Fixed-point two’s complement interval arithmetic and rounding are

described in detail in this section.

3.1 Fixed-point two’s complement arithmetic

The main focus of this design is to build a fixed-point interval arithmetic and

logic unit as against certain floating-point interval units that have been designed

previously and discussed in brief in section 1.2. To this end, it is important to get

familiar with the operations performed on fixed-point numbers. This section is an in

depth study on the working of fixed-point arithmetic. It explains the functionality of

the three basic operations viz. addition, subtraction and multiplication, performed

on fixed-point numbers. Given that our design is oriented towards DSP related

applications, division is not performed in hardware. Division by powers of 2 is

performed by implementing the shift operation. It is easy to model our hardware

16


based on the application for which the ALU is going to be used once we have a proper

understanding of the working of fixed-point arithmetic. Irrespective of being a real

number ALU or an ALU for Interval Arithmetic, the logic behind the mathematics

that is being performed remains the same.

As the most generalized case, two’s complement format for representing the fixed-

point numbers has been used. It accounts for operations performed on both positive

and negative numbers.

3.1.1 Representation of numbers

In the binary number system, an N-bit word represents integer values from 0

to 2N − 1. This is referred to as the unsigned integer representation. The fixed-

point representation has predefined position of the radix point, which means that

we have fixed number of bits reserved for the integer part and a fixed number for

the fractional part. A 32-bit number having 16 bits reserved for integer part and 16

for the fractional part is represented as 16:16. However, this mode lacks the ability

to represent negative numbers.

Twos complement method of representing fixed-point numbers accounts for both

positive and negative numbers. The MSB of this fixed-point number indicates the

sign (referred to as the sign-bit), whereas the rest of the bits define the magnitude of

the number. Figure 3.1 shows the structure of an N-bit signed number in twos com-

plement format as used in this implementation. The range of numbers represented

0 1 N -1

sign fraction

Figure 3.1: Two’s complement number representation

by an N-bit word is from −2N−1 to 2N−1 − 1. Two’s complement representation

17


Table 3.1: Two’s complement fixed-point representation

Binary Two’s Complement Decimal Equivalent

00010110.11000000 22.75

11101001.01000000 -22.75

00001000.00100000 8.125

11110111.11100000 -8.125

of numbers greatly simplifies the hardware implementation of the arithmetic being

performed. Table 3.1 provides a few examples of 16 bit two’s complement fixed-point

numbers in the 8:8 format.

3.1.2 Arithmetic Operations

This section provides examples of various operations performed on two’s com-

plement fixed point numbers. Three basic operations of addition, subtraction and

multiplication are considered. Let us go through each of these operations one at a

time.

• Addition

Addition involves simple addition of bits when the number is represented in

two’s complement form. The two operands are sign extended from 16 bits to

17 bits and the 17th bit of the result of addition is then sign extended to obtain

the 32 bit output. The following examples illustrate the addition operation:

1. 22.75 + (-8.125) = 14.625

0 0 0 0 1 0 1 1 0 . 1 1 0 0 0 0 0 0

+ 1 1 1 1 1 0 1 1 1 . 1 1 1 0 0 0 0 0

0 0 0 0 0 1 1 1 0 . 1 0 1 0 0 0 0 0

The 17th bit, 0, is used for sign extension and 7 zeros are added to

the left. 8 zeros are added to the right to perform sign extension of the

18


decimal part. Hence 00001110.10100000 in the 8:8 format is represented

as 0000000000001110.1010000000000000 in the 16:16 format.

2. (-8.125) + (-8.125) = (-16.25)

1 1 1 1 1 0 1 1 1 . 1 1 1 0 0 0 0 0

+ 1 1 1 1 1 0 1 1 1 . 1 1 1 0 0 0 0 0

1 1 1 1 0 1 1 1 1 . 1 1 0 0 0 0 0 0

The 17th bit, 1, is used for sign extension and 7 ones are added to

the left. 8 zeros are added to the right of the number to perform sign

extension of the decimal part. Hence 11101111.11000000 in the 8:8 format

is represented as

1111111111101111.1100000000000000 in the 16:16 format.

Thus, in terms of hardware, the 16 bit number needs to be sign extended to

17 bits and the 17th bit of the result needs to be used for sign extension.

• Subtraction

Similar rules as followed in addition need to be followed while performing

subtraction of numbers in the two’s complement form. The only change that

needs to be done is that, we need to take the two’s complement of the number

to be subtracted and then add it to the other number which is in its two’s com-

plement form. The remaining procedure remains unchanged. The following

examples illustrate the subtraction operation:

1. 22.75 - 8.125 = 14.625

22.75 in two’s complement form is represented as 00010110.11000000.


Its two’s complement is 11110111.11100000. Hence,

0 0 0 0 1 0 1 1 0 . 1 1 0 0 0 0 0 0

+ 1 1 1 1 1 0 1 1 1 . 1 1 1 0 0 0 0 0

0 0 0 0 0 1 1 1 0 . 1 0 1 0 0 0 0 0

19




extension of the decimal part. This gives us the desired result of the

subtraction 14.625.

2. 8.125 - 22.75 = (-14.625)



It’s two’s complement is 11101001.01000000. Hence,

0 0 0 0 0 1 0 0 0 . 0 0 1 0 0 0 0 0

+ 1 1 1 1 0 1 0 0 1 . 0 1 0 0 0 0 0 0

1 1 1 1 1 0 0 0 1 . 0 1 1 0 0 0 0 0

The 17th bit, 1, is used for sign extension and 7 ones are added to


extension of the decimal part. This gives us the desired result of the

subtraction -14.625.

3. 22.75 - (-8.125) = 30.625



It’s two’s complement is 00001000.00100000. Hence,

0 0 0 0 1 0 1 1 0 . 1 1 0 0 0 0 0 0

+ 0 0 0 0 0 1 0 0 0 . 0 0 1 0 0 0 0 0

0 0 0 0 1 1 1 1 0 . 1 1 1 0 0 0 0 0



extension of the decimal part. Thus, we obtain the desired result 30.625.

• Multiplication

Multiplication of two N-bit numbers gives a 2N-bit result. Consequentially,

20

3.2 Outward/Directed Rounding

the issue of sign extension is not involved in multiplication. Multiplication of

two 16:16 numbers will result in a 32:32 number. In my examples, I consider

numbers of the 4:4 format. I have illustrated a couple of examples to explain

the multiplication operation:

1. 1.25 ∗ 3.25 = 4.0625

0001.0100 ∗ 0011.0100 = 00000100.00010000

2. 7.9375 ∗ 7.9375 = 63.00390625

0111.1111 ∗ 0111.1111 = 00111111.00000001

If any of the multiplicand is a negative number, we have to first take the

two’s complement of that number and then perform the usual multiplication

as explained above. The sign of the result will depend on the number of two’s

complements that we have taken before performing the multiplication. In

hardware, this is achieved by doing the exclusive-OR of the two sign bits.

Addition, subtraction and multiplication are three, very important operations

performed by the I-ALU. Functionality of all these three operations has been elab-

orately described in this section. This study goes a long way in determining the

hardware architecture of the system. Besides these operations, multiply-accumulate

forms the heart of any DSP processor. Special emphasis has been laid on this in

the following sections. Division by numbers other than degenerate powers of 2 has

less occurrence in DSP related applications. Division is implemented only by the

shift operation because of the cost of using division with respect to time and area.

Section 3.2 addresses the important issue of rounding.


For most systems, although the internal buses of an ALU may be wide enough,

they have fixed sized registers. Input to this system is 16-bit with an internal bus

which is 32-bit wide. Word length of outputs of functional units must be reduced

21


for them to be stored in these smaller sized registers. This reduction in word length

is achieved by the rounding operation. The bits of lower significance of the output

are suitably discarded depending on the rounding direction of the operand. This

introduces errors called precision rounding errors. However, Interval Arithmetic

makes sure that the exact result of the operation lies within the output interval.

Provision is made in this system to round the output interval values from 32 bits to

either 24 bits or 16 bits depending on an input provided. This provision is made

to keep the design flexible for different applications.

As discussed earlier, the proposed system is based on two’s complement fixed-

point number representation. Two’s complement number representation greatly

simplifies the rounding algorithm because an identical procedure needs to be fol-

lowed for rounding up positive and negative numbers. Also, a different but identical

procedure is maintained for rounding down positive and negative numbers. IEEE

standard defines four rounding modes viz. round towards nearest, round towards

zero, round towards positive infinity and round towards negative infinity. While ap-

plying to Interval Arithmetic, we are concerned with two cases: rounding towards

positive infinity and rounding towards negative infinity. The algorithms for round-

ing towards positive infinity and negative infinity are explained below with suitable

examples.

• Rounding towards negative infinity.

Rounding towards negative infinity refers to denoting a high precision number

by the greatest machine representable number of low precision but smaller in

value. In fixed-point two’s complement representation, this is achieved by

simply discarding the bits of lower significance. This algorithm holds true for

positive and negative numbers as illustrated by the following examples:

– Positive numbers

6.78125 in the 8:8 format is represented as 00000110.11001000.

In the 8:4 format, it is represented as 00000110.1100, which is 6.75.

22


– Negative numbers

-6.78125 in the 8:8 format is represented as 11111001.00111000.

In the 8:4 format, it is represented as 11111001.0011, which is -6.8125.

• Rounding towards positive infinity.

Rounding towards positive infinity refers to denoting a high precision number

by the smallest machine representable number of low precision but greater

in value. In fixed-point two’s complement representation, this is achieved by

performing a logical ‘OR’ on the bits of lower significance to be discarded and

then adding this bit to the number to be retained. Once again, this algorithm

holds true for positive and negative numbers as illustrated by the following

examples:

– Positive numbers

6.78125 in the 8:8 format is represented as 00000110.11001000.

In the 8:4 format, it is represented as 00000110.1101, which is 6.8125.

– Negative numbers

-6.78125 in the 8:8 format is represented as 11111001.00111000.

In the 8:4 format, it is represented as 11111001.0100, which is -6.75.

The above examples can be used for rounding of 32 bit fixed-point numbers in the

16:16 format to 24 bit fixed-point numbers in the 16:8 format. A similar procedure

is followed if the output has to be reduced to 16 bits from 32 bits. These examples

cover all aspects of the rounding algorithm. It is called the “Outward Rounding”

or “Directed Rounding” algorithm and is responsible for the validation of results

provided by interval analysis. The study of this algorithm makes it very simple to

design the hardware to implement outward rounding. The proposed design man-

ifests a separate rounding unit which takes inputs from the functional units and

provides the outputs of the system.

After getting acquainted with the design specifications of the ALU, I now proceed

to give a comprehensive description of its architecture. A few different architectures

23


have been explored to arrive at an optimum solution.

24

Chapter 4

Hardware Architecture

This chapter of the thesis contains a description of all the modules that constitute

the Interval-ALU. It gives details of the logic design at the gate level for the whole

system, one module at a time. The hardware model at the RTL level of abstraction

is built from these logic designs. Since throughput is the main performance metric

to be optimized, the logic is designed with reduction of the critical path delay in

mind. Several pipelined versions of the design are built along with the basic non-

pipelined one to improve the throughput. I begin with the top level block diagram

of the design and then go into the details of each module.

4.1 Overall Architecture

The overall architecture of the ALU can be seen in the block diagram shown in

Figure 4.1. The hardware model is divided into four parts, viz. the flag generator,

lower bound and upper bound modules, and the rounding unit. The flag generator

module is responsible for generating the control signals for the more complicated

multiplication operation. As the name suggests, the lower bound module and the

upper bound module calculate the lower and upper bounds of the output interval,

25

4.1 Overall Architecture

respectively. These two modules are independent of each other and hence operate in

parallel. The rounding unit implements the Outward Rounding algorithm explained

earlier.

Figure 4.1: Top Level Block Diagram of the ALU

The ALU is designed for operation on 16 bit input interval numbers in the two’s

complement form. The ALU has an input line that allows selection of the multiply-

accumulate mode, acc select. The ALU operates in the accumulate mode as long

as this line is held high. Another input line, rctl, determines the number of output

bits. The output is rounded to 24 bits when this line is held high and 16 bits when

this line is held low. Table 4.1 lists all the inputs to the ALU.

The ALU has two 24-bit output lines that represent the lower and upper bounds

of the resulting interval. Besides this, there are output lines to highlight the special

case in multiplication, the union of two disjoint sets as well as the intersection of

26

4.2 Flag Generator Module

Table 4.1: Description of ALU Inputs

Input Description Bit Width

xL Lower bound on left-hand operand 16 bits

xU Upper bound on left-hand operand 16 bits

yL Lower bound on right-hand operand 16 bits

yU Upper bound on right-hand operand 16 bits

command Mathematical operation to be performed 3 bits

acc select Perform MAC when asserted 1 bit

rctl Width of output results (16 or 24-bits) 1 bit

two disjoint sets. A further explanation of these output lines is provided in the

following sections. Table 4.2 lists all the outputs of the ALU.

Table 4.2: Description of ALU outputs

Output Description Bit Width

zL Lower bound on result 24 bits

zU Upper bound on result 24 bits

next Valid results on output lines 1 bit

union Union of disjoint sets 1 bit

empty Intersection of disjoint sets 1 bit

4.2 Flag Generator Module

A major significance of the design is the reduction in the number of multiplica-

tions performed to evaluate the result of an interval multiplication operation. The

flag generator module forms the control logic for the multiplication operation. It

generates the necessary control signals to determine one of the nine cases in multi-

27

4.3 Lower Bound and Upper Bound Modules

plication. Based on the values of the input operands, it generates a 4 bit mul flag

which selects among the nine cases. Table 4.3 shows the case to be selected based

on the value on the mul flag.

Table 4.3: mul flag for the Multiplication operation

mul Case Result

0001 xL ≥ 0; yL ≥ 0 [xLyL,xUyU ]

0010 xL ≥ 0; yL < 0 < yU [xUyL,xUyU ]

0011 xL ≥ 0; yU ≤ 0 [xUyL,xLyU ]

0100 xL < 0 < xU ; yL ≥ 0 [xLyU ,xUyU ]

0101 xL < 0 < xU ; yU ≤ 0 [xUyL,xLyL]

0110 xU ≤ 0; yL ≥ 0 [xLyU ,xUyL]

0111 xU ≤ 0; yL < 0 < yU [xLyU ,xLyL]

1000 xU ≤ 0; yU ≤ 0 [xUyU ,xLyU ]

0000 xL < 0 < xU ;yL < 0 < yU [min(xUyL,xLyU),

max(xLyL,xUyU)]

The logic behind the generation of this flag is shown in Figure 4.2. Table 4.4

explains the generation of the various flag signals used in Figure 4.2. As shown in

the table, the flag signals are generated based on the values of the input operands.

These flags are used to obtain the mul signal based on which the inputs to the

multipliers in the main functional units are selected. This reduces the number of

multiplications to be performed.


The lower bound and the upper bound modules have very similar hardware

structures. The primary difference is the inputs that drive its functional units.

These modules form the heart of the ALU as most of the logic involved in the design

28


Table 4.4: Flag Generation

Condition Flag Generated

xL ≥ 0 flag 1

yL ≥ 0 flag 2

xU ≤ 0 flag 3

yU ≤ 0 flag 4

(xL < 0)&&(xU > 0) flag 5

(yL < 0)&&(yU > 0) flag 6

Figure 4.2: Flag Generation Module

is concentrated in them. Both modules are independent of each others operation

which makes the working of the ALU dependent only on the input signals and not

to its internal hardware. Each module is characterized by dedicated hardware to

29


perform the accumulate operation. This is an important feature of the ALU because

dot product which is implemented through the multiply-accumulate operation forms

the core requirement of any DSP processor. Figure 4.3 and Figure 4.4 display the

basic block diagram of each of the two modules. As seen in the block diagrams, the

two modules are very similar in architecture. However, it is important to note the

different status lines generated by different portions of the modules.

Figure 4.3: Block Diagram of Lower Bound Module

Figure 4.4: Block Diagram of Upper Bound Module

In a non-pipelined architecture, the circuit performance is determined by these

modules of the design. Critical path of a design may be defined as the single slowest

feasible path contained in the design. Greater the logic depth, longer is the critical

path, lower is the frequency at which the circuit can operate and thus lower is the

30


throughput. Since it forms the critical path, a significant effort needs to be put in

to optimize it. Pipelining provides the ideal solution to improve the throughput

problem and is discussed in detail beginning section 4.5. I will now go through each

of the individual blocks that make up the ALU as shown in Figure .

4.3.1 Functional Units and Control Logic

The combinational logic required to perform arithmetic operations on interval

numbers is located in the functional unit block. Apart from the difference in a few

output status lines, the hardware for functional units in each of the two modules is

identical. Each module has an adder/subtracter, a multiplier and other combinato-

rial logic to implement the set operations. Figure 4.5 shows the functional unit in

the lower bound module. It is important to note that the inputs to the adder are

the two lower bounds of the input operands, while the inputs to the subtractor are

the lower bound of the first input operand and the upper bound of the second. The

type of operation and the mul signal are used as controls to determine the outputs

of the functional units. The status line empty is generated by this portion of the

design to indicate the intersection of two disjoint sets.

Figure 4.6 shows the functional unit in the upper bound module. The inputs

to the adder and subtractor are different for this module than the lower bound

module. The upper bounds of both the input operands drive the adder, while the

upper bound of the first input operand and the lower bound of the second input

operand drive the subtractor. Once again, the mul flag determines the inputs to the

multiplier and the outputs of the union and intersection set operations. The status

line union is generated by this module which indicates the union of two disjoint

sets.

The outputs of the arithmetic units are given to the special case multiplication

block. The next section describes the details of the special case multiplication block.

31


Figure 4.5: Lower Bound Module

4.3.2 Special Case Multiplication Block

The situation in which both input operands include zero in their intervals rep-

resents a special case, and is referred to as the ‘Special Case Multiplication’. In

contrast to the normal cases where interval multiplication requires two multiplica-

tions to be performed, comparison between two multiplied values is required for each

32


Figure 4.6: Upper Bound Module

of the bounds in this special case. Hence we require a memory element which would

store a value and make it available for comparison with the next available value. It

requires two multiplications and one comparison to be performed to obtain each of

the two bounds. The hardware to determine the lower bound and the upper bound

is identical and is repeated in both the modules. A status line next is taken from this

block in the lower bound module to indicate the special case. Figure 4.7 shows the

33


hardware architecture of this block. As seen in the diagram, the left/right out c

line coming in from the functional unit block is stored as left/right out r and used

for comparison. The result of this comparison is used only for the interval multi-

plication operation when the mul flag is 0000. The minimum value is selected for

the lower bound module and the maximum value is selected for the upper bound

module. The special case multiplication may lead to synchronization problems if

not dealt with properly.

Figure 4.7: Special Case Multiplication

4.3.3 Multiply-Accumulate Block

A multiply-accumulator forms a very important part of the ALU, more so, be-

cause it is intended for use in DSP applications. DSP applications are characterized

by repetitive multiply-add operations that are computed by use of a dot product.

34

4.4 Rounding Unit

Mathematically, the dot product can be calculated as:

a · b =∑n

i=0 aibi

This operation can be readily performed by a multiply accumulate block. Figure

4.8 shows the hardware architecture of the multiply-accumulate block. As we can

see in the figure, it consists of an adder and a memory element which acts as the

accumulator. An input line acc select determines whether the output needs to be

accumulated or not. When high, the block is in accumulate mode. The output of

this block is 32 bits long and is given to the rounding unit, where outward rounding

is implemented.

Figure 4.8: Multiply-Accumulate Module

4.4 Rounding Unit

The rounding unit forms a critical part of the Interval-ALU. This unit performs

the outward rounding, which guarantees the result of a computation to lie within

the output interval. The proposed design has provision to round a 32 bit output

from the previous block to a 24 bit or a 16 bit word depending on the application

35

4.4 Rounding Unit

for which it is going to be used. An input line, rctl, determines the number of bits to

which the output has to be rounded. When this line is high, the output is rounded

to 24 bits, else it is rounded to 16 bits. The outward rounding algorithm has been

discussed in section 3.2, hence this section concentrates on describing the hardware

to implement these rounding modes. Figure 4.9 shows the architecture for rounding

the output of the lower bound module.

Figure 4.9: Lower Bound Rounding

From the figure we can see that 8 or 16 bits of lower significance are discarded

based on the rctl input, and the higher 24 bits or 16 bits are retained.

The rounding operation for the upper bound module is slightly more complicated

as compared to the lower bound module. In this case, the bits of lower significance

are not simply discarded, but are logically ‘OR’ed and the resultant bit is added to

the bits to be retained. If the rctl line is high, the last 8 bits are logically ‘OR’ed

and added to the remaining 24 bits. On the other hand, if the rctl line is low, the

36

4.4 Rounding Unit

last 16 bits are logically ‘OR’ed and added to the 16 bits of high significance. Figure

4.10 shows the architecture for rounding the output of the upper bound module.

Figure 4.10: Upper Bound Rounding

This completes the architecture of the entire ALU. From the architecture, it is

safe to say that the lower bound module and the upper bound module are the

critical modules in the design. Maximum logic is concentrated in them and hence it

is important to optimize them to a high degree. The following section concentrates

on optimizing these critical modules to improve the throughput of the system, by

pipelining the design to the highest degree. It provides details of several pipelined

versions of the I-ALU.

37

4.5 Pipeline Architecture of the Design


This section presents several versions of the I-ALU, pipelined to various degrees

so as to achieve maximum throughput. Pipelining is a technique used to reduce the

critical path of the circuit and hence improve the speed at which the circuit can

operate. Increase in throughput for any circuit by implementing pipelining comes at

the cost of increased area, increased power dissipation and increased initial latency.

Although the technique of pipelining may portend to have more disadvantages than

advantages, in DSP systems where throughput is of prime importance, it goes a

long way in improving the efficiency of the system. As seen earlier, DSP systems

are characterized by several multiplication and addition instructions. A multiplier

involving combinational logic alone has a very high logic depth and is one of the

main candidates that forms the critical path. Hence, to reduce this logic depth, it is

necessary to pipeline the multiplier. However, the depth to which a multiplier can

be pipelined saturates at a certain point and then it becomes necessary to further

pipeline the design to improve its efficiency. The following sections discuss about

these pipelining techniques in further detail.

4.5.1 Need for Pipelining

In the proposed I-ALU design, synthesis results have shown that without any

level of pipelining, the lower bound module and the upper bound module form the

critical path in the design. Figure 4.11 shows the critical path in these modules.

The diagram shows that the critical path traverses some control logic, a multiplier,

followed by some more combinational logic and then an adder. The huge path forces

the design to work at lower clock frequencies. The main portion of the logic in this

critical path is that of the multiplier. There would be a significant decrease in the

clock period if this portion of the logic were to be pipelined. Implementation of a

few pipelined multiplier architectures is reported in the following section.

38


Figure 4.11: Critical Path

4.5.2 Partially Pipelined Design

Partially pipelined architecture refers to replacing the multiplier formed by com-

binational logic alone with a pipelined multiplier architecture. The design tool ‘Syn-

opsys’ provides several pipelined multiplier architectures in its library that can be

used [33]. A significant increase in the circuit performance is observed from the use

of these Design-ware IP blocks. However, this improvement in performance comes

at the cost of an increased area and power dissipation. Hence, a suitable trade-off

needs to be done to choose the best design. Figure 4.12 shows the architecture of

an non-pipelined multiplier, while Figure 4.13, Figure 4.14, Figure 4.15 and Figure

4.16 show an abstract architecture of a two-level, three-level, four-level and five-level

deep pipelined multipliers, respectively.

As seen in Figure 4.12, the cloud of combinational logic in a non-pipelined mul-

tiplier is large. Pipelining introduces registers in the path of combinational logic

39


Figure 4.12: Non-Pipelined Multiplier Architecture

Figure 4.13: Two-stage Pipelined Multiplier Architecture

Figure 4.14: Three-stage Pipelined Multiplier Architecture

thereby reducing the critical path. The subsequent figures show that as the number

of stages of pipelining increase, the cloud size between two consecutive registers

decreases. Hence each pipelined multiplier operates at a faster clock than the previ-

ous. However, when one such pipeline multiplier is included as a part of the circuit,

after a certain level of pipelining, the multiplier ceases to be a part of the critical

path. Instead, the other combinational logic involving adders and multiplexors form

40


Figure 4.15: Four-stage Pipelined Multiplier Architecture

Figure 4.16: Five-stage Pipelined Multiplier Architecture

the critical path. Thus, a pipelined multiplier contributes in a big way to improve

the throughput of a system, but to obtain more performance improvement, further

pipelining of the design is required, as discussed in the following section.

4.5.3 Highly Pipelined Design

A highly pipelined design involves the use of a pipelined multiplier which pro-

vides the best results for this design. As results in chapter 5 indicate, the perfor-

mance of the ALU saturates for pipelined multipliers with more than 3 stages. All

architectures that employ pipelined multipliers of more than 3 stages operate at ap-

proximately the same clock frequency. This is because after 3 stages, the multiplier

ceases to be a part of the critical path, but the maximum logic depth is formed

by other combinational logic involving multiplexors and adders as shown in Figure

4.11. For further performance enhancement, it is necessary to reduce this cloud of

combinational logic. Thus, improvement in the performance of the circuit would

be achieved by splitting this cloud of combinational logic into smaller clouds, apart

41


from using a three stage pipelined multiplier. Figure 4.17 shows the logic diagram

for this highly optimized design.

Figure 4.17: Highly Pipelined Architecture

The figure shows that several registers are included in the critical path to reduce

its length. The combinational logic between any two consecutive registers is reduced

to some control logic or simply an adder. This produces a significant decrease in

the overall clock period of the circuit as would be evident from the synthesis results.

The amount of combinational logic is maximum in multipliers, followed by adders

and then the control logic, amongst the combinational logic blocks used in this

design. Hence it is important to introduce a register in the path prior to the adder

to reduce the clock period. Introduction of this register brings down the clock period

from 17.73 ns to 3.55 ns. However, introduction of two more stages of pipeline in

the control logic further decreases the clock period to 3.25 ns, as shall be seen in

the following section. This improvement in performance from 3.55 ns to 3.25 ns, a

9.23% decrease, comes at a 4.97% increase in area of the design.

42


In conclusion, this chapter forms the heart of the work presented in this thesis.

It provides a detailed explanation of the architecture of the overall design. Every

module and its functionality has been explained comprehensively. One of the most

important optimizations for the circuit by way of pipelining has been discussed.

During this course, the role played by multipliers in governing the performance

of the ALU has been mentioned and finally, a highly pipelined architecture which

employs a three stage multiplier will be shown to be the best design based on

synthesis results. The next chapter provides statistical results obtained from various

simulation and synthesis runs. Throughput, area and power dissipation have been

obtained for all previously described architectures.

43

Chapter 5

Testing and Results

This chapter provides a comparison of timing, area and power dissipation made

between the different architectures of the I-ALU based on statistical data obtained

from simulation and synthesis results. Based on its significance for DSP applica-

tions, throughput has been considered as the prime performance metric to analyze

the various designs and come up with an optimum solution. Efforts to improve the

throughput of the system have an adverse effect on the area, latency and power

dissipation of each design. Tabulations and graphical aids have been used to show

the comparisons between these metrics for various architectures.

The functionality of the design was verified by running simulations in the Cadence

environment. Verilog HDL was used to capture the behavior of the ALU while

Synopsys was used for synthesis purposes. The 0.18 µm ‘OSU Standard Cell Library’

was used while synthesizing the various modules. Synopsys Design Compiler [34]

was used for timing analysis and Synopsys PrimePower [35] was used to obtain the

power dissipation results.

44

5.1 Simulation Results

5.1 Simulation Results

Verification of the operation was done by running simulations on the design

for 100% code coverage. All possible input combinations were considered and the

results from the simulation runs were compared with the expected values. A special

note needs to be made of three cases; the union of two disjoint sets requires two clock

cycles, the intersection of two disjoint sets results into a null set and the special case

multiplication requires two clock cycles to obtain the result as against one clock cycle

required for all other operations. Figure 5.1 shows these results for the addition,

subtraction and multiplication operations. As seen in the figure, the results of

addition, subtraction and multiplication were obtained after one clock cycle each

i.e. this design has a latency of 1. However, for the special case multiplication,

the output was obtained only after two clock cycles. The status line next goes

high indicating the occurrence of this case and informing that the actual output

is available in the next clock cycle instead of the current. Also, as soon as the

acc select line goes high, the ALU goes into the accumulate mode as can be seen

from the simulation results.

Figure 5.2 shows the simulation results for the interval union and intersection

operations. Apart from the usual behavior of union and intersection, the two special

cases are worth noting. The union of two disjoint sets results in two output intervals

on two consecutive clock cycles, and this is appropriately indicated by the union

status line. For the case of intersection of two disjoint sets, the empty status line

goes high indicating a null set, in which case the interval output is considered invalid.

5.2 Synthesis Results

The different architectures described in the previous chapter have been synthe-

sized. From the synthesis results, a comparative analysis in terms of throughput,

45


Figure 5.1: Simulation Results for Add, Subtract and Multiply

Figure 5.2: Simulation Results for Interval Union and Intersection

area and power dissipation has been done and presented in this section in the fol-

lowing way: Sections 5.2.1 to 5.2.3 contain the results for timing and area of the

various architectures. Section 5.2.1 gives the results for the most primitive design,

the non-pipelined version. Section 5.2.2 gives the comparative results for the various

46


architectures using pipelined multipliers of increasing depths. Section 5.2.3 presents

the results for the highly pipelined architecture which happens to be the most op-

timized design amongst those discussed as far as throughput is concerned. While

performing these tests on the overall designs of these architectures, each individual

lower level modules viz. the flag generator, the lower bound module, the upper bound

module and the rounding unit comprising the overall design were optimized for best

performance. These optimized lower level modules were then used to obtain the

results for the highest level module. This ensures that the results obtained are the

best possible. Section 5.3 gives the results of power dissipation.

5.2.1 Non-Pipelined Architecture

The timing results of the optimized modules as reported by Synopsys for the

non-pipelined architecture are tabulated in Table 5.1. A graphical representation of

these results, shown in Figure 5.3, gives a better understanding of the distribution

of logic in the various modules. The obtained results are found to be coherent with

the expected results.

Table 5.1: Timing Reports for the Non-Pipelined Architecture

No. Module Name Timing Report

1 Flag Generator Module 3.30 ns

2 Lower Bound Module 17.79 ns

3 Upper Bound Module 17.29 ns

4 Lower Bound Rounding 2.71 ns

5 Upper Bound Rounding 3.42 ns

6 Overall Architecture 17.73 ns

As specified in the previous section, the critical path of the design is either in

the lower bound or the upper bound modules. This can aptly be seen from these

results. The two modules have the maximum concentration of combinational logic

47


and hence are responsible for the timing of the overall circuit. In comparison to

these modules, the other modules have fewer combinational blocks and hence can

operate at faster speeds. The values listed in the table are as reported by Synopsys.

A lower value of the clock period is achieved for the overall design as compared to

the lower level modules because the tool re-optimizes the timing of all the modules

to achieve the best performance while synthesizing the top level module. Figure 5.3

is a graphical representation of these results.

Timing Reports for Non-Pipelined Design

3.3

17.79 17.73

2.713.42

17.29

0

2

4

6

8

10

12

14

16

18

20

Flag Lower Bound Upper Bound Lower Round Upper Round Overall DesignModule

Tim

ing

(in n

s)

Timing Reports for Non-Pipelined Design

Figure 5.3: Timing Reports of Non-Pipelined Architecture

Table 5.2 gives the area reports of all the lower level and upper level modules.

Figure 5.4 is a graphical representation of these results. As expected, most of the

area of the design is formed by the lower and upper bound modules. The area of

48


the overall design is approximately the sum of the areas of the individual modules.

Table 5.2: Area Reports of Non-Pipelined Architecture

No. Module Name Area Report

1 Flag Generator Module 8,964 µm2

2 Lower Bound Module 116,502 µm2

3 Upper Bound Module 110,855 µm2

4 Lower Bound Rounding 6,096 µm2

5 Upper Bound Rounding 11,059 µm2

6 Overall Architecture 253,476 µm2

From the results in Table 5.3, the minimum clock period required for the opera-

tion of circuit is 17.73 ns. Thus, the maximum frequency at which it can operate

is 56.401 MHz. This is a considerably low frequency of operation and hence rep-

resents a design with low throughput. Design efforts need to be directed towards

improving the throughout of the non-pipelined design. This is achieved by using a

pipeline architecture. The following sections show the results for various pipeline

architectures. It is important to note that the area of this design is 253,476 µm2

and the efforts to improve the circuit are going to have direct impact on the area.

The area of the design increases as the number of pipeline stages increase.

5.2.2 Design with Pipeline Multipliers

As described in the previous chapter, the purpose of using pipeline multipliers is

to reduce the critical path in the design. The critical path decreases as the number

of stages of the multiplier increase. This is evident from the results tabulated in

Table 5.3. It shows the minimum clock period for the operation of various stages

of the pipeline multipliers. Since the multiplier is a part of the critical path of the

entire design, the use of pipeline multipliers has a direct influence on increasing the

49


Area Report of Non-Pipelined Design

8964

116502 110855

6096 11059

253476

0

50000

100000

150000

200000

250000

300000

Flag Lower Bound Upper Bound Lower Round Upper Round OverallDesign

Module

Are

a (in

um

2 )

Area Report of Non-Pipelined Design

Figure 5.4: Area Reports for the Non-Pipelined Architecture

frequency at which the circuit can operate.

Increasing the number of stages of the pipelined multipliers in the design would

certainly improve the throughput. The multiplier is present in the lower bound

module and the upper bound module. Synthesis runs on these modules clearly

indicate the decrease in the minimum clock period required for their operation.

Apart from these two modules, the flag generator module, the lower bound round

module and the upper bound module will not be affected by the use of various stages

of the pipeline multiplier. Table 5.4 shows the timing reports for the two significant

modules and the effect of using pipelined multipliers on the overall design.

50


Table 5.3: Timing Reports for the Pipelined Multipliers

Pipeline Multiplier Minimum Clock Period

Two Stage 6.52 ns

Three Stage 3.38 ns

Four Stage 2.38 ns

Five Stage 2.38 ns

Table 5.4: Timing Reports for various Pipelined Architectures

Pipeline Stage Lower Module Upper Module Overall Design

Non-Pipelined 17.79 ns 17.29 ns 17.73 ns

Two Stage 7.75 ns 7.51 ns 7.75 ns

Three Stage 4.75 ns 4.67 ns 4.75 ns

Four Stage 4.67 ns 4.57 ns 4.67 ns

Five Stage 4.78 ns 4.60 ns 4.78 ns

51


Figure 5.5 plots the values of the minimum clock frequency for the different overall

designs. From the graph, it can be observed that there is a significant improvement

in the timing of the design until the use of a three-stage pipelined multiplier. There

is almost a 100% reduction in the clock period because of the use a two-stage

pipelined multiplier from a non-pipelined design. A further significant decrease

in the clock period is observed with the use of a three stage pipelined multiplier.

However, increases in the level of pipeline beyond the third stage does not change the

timing of the overall design by a significant amount. This is because, the multiplier

ceases to be a part of the critical path now. The critical path is formed by other

combinational logic in the design.

The area of the design is directly proportional to an increase in the number of

pipeline stages. Hence, following the law of diminishing returns, a design with a

three stage pipelined multiplier would be the best one among the various designs

described. Table 5.5 gives area reports of the lower bound module, upper bound

module and the overall architectures for these designs. Figure 5.6 is a plot of the

areas of the different architectures.

Table 5.5: Area Reports for various Pipelined Architectures

Pipeline Stage Lower Module Upper Module Overall Design

Non-Pipelined 116,502 µm2 110,855 µm2 253,476 µm2

Two Stage 172,918 µm2 173,370 ns 372,407 µm2

Three Stage 211,968 µm2 212,029 µm2 450,116 µm2

Four Stage 248,847 µm2 246,860 µm2 521,826 µm2

Five Stage 278,246 µm2 278,897 ns 583,262 µm2

From the timing and area analysis, it is clear that there needs to be trade-off

between the two performance metrics; delay and area, to determine the better

design. From the timing plot, we come to the conclusion that the improvement

52


Timing Report of Pipelined Architectures

7.75

17.73

4.784.674.75

0

2

4

6

8

10

12

14

16

18

20

Non-Pipelined Two Stage Three Stage Four Stage Five Stage

Pipeline stage

Tim

ing

(in n

s)

Timing Report of Pipelined Architectures

Figure 5.5: Timing Reports of Different Pipelined Architecture

in performance of the design saturates beyond the use of a three stage pipelined

multiplier. With every stage, there is a significant increase in the area of the design.

Figure 5.7 is a plot of area and timing on the y-axis of the same graph for the

various designs. From this plot, we can firmly say that it is best to use a three stage

pipelined multiplier in the proposed design.

53


Area Reports of Different Pipelined Architectures

583,262

521,826

450,116

372,407

253,476

0

100,000

200,000

300,000

400,000

500,000

600,000

700,000


Pipeline Stage

Are

a (in

um

2 )

Area Reports of Different Pipelined Architectures

Figure 5.6: Area Reports of Different Pipelined Architecture

5.2.3 Highly Pipelined Design

The architecture of a highly pipelined design was shown in the previous chapter.

It uses a three stage pipelined multiplier and the remaining combinational logic is

pipelined to the highest degree to achieve maximum performance. Table 5.6 com-

pares the timing of a highly pipelined design to a non-pipelined design. A significant

improvement in the design is observed as far as the the main performance metric

is concerned. Figure 5.8 is a graphical representation of the values in Table 5.6.

An improvement of the order of 500% in the performance of the pipelined architec-

ture over the non-pipelined architecture is observed. Considering the importance

of throughput for DSP applications, this is a favorable trade-off when compared to

the increase in the area of the design. Table 5.7 shows the comparison between the

54


Timing and Area Reports of Various Architectures

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

2000000


Pipeline Stages

Timing Reports for various Architectures (10e-5 ns) Area Reports for various Architectures(um^2)

Figure 5.7: Timing and Area Reports of Different Pipelined Architectures

frequency of operation, the upper bound on the throughput and the initial latency

in a non-pipelined and a highly pipelined design.

Finally, Table 5.8 shows the comparison between the areas of individual modules

in the non-pipelined and the highly-pipelined designs. There is considerable amount

of increase in the area for the highly-pipelined design. However, with the knowledge

that area being a secondary performance metric for the design, the improvement in

the throughput of the highly-pipelined design outweighs the disadvantage brought

about by the increase in area. Figure 5.9 is a graphical representation of the values

in Table 5.8.

55


Table 5.6: Timing Reports for Non-Pipelined and Highly-Pipelined Architectures

Module Name Non-Pipelined Highly Pipelined % Improvement

Flag Generator 3.30 ns 3.30 ns 0.00

Lower Bound 17.79 ns 3.25 ns 547.38

Upper Bound 17.29 ns 3.25 ns 532.00

Lower Round 2.71 ns 2.71 ns 0.00

Upper Round 3.42 ns 3.42 ns 0.00

Overall Design 17.73 ns 3.25 ns 545.54

Table 5.7: Results for Non-Pipelined and Highly-Pipelined Architectures

Design Frequency Throughput Latency

Non-Pipelined 56.401 MHz 56 × 106 2

Highly-Pipelined 307.692 MHz 307 × 106 7

Table 5.8: Area Reports for Non-Pipelined and Highly-Pipelined Architectures

Module Name Non-Pipelined Highly Pipelined % Increase in Area

Flag Generator 8964 µm2 8964 µm2 0.00

Lower Bound 116502 µm2 262349 µm2 125.18

Upper Bound 110855 µm2 260793 µm2 135.25

Lower Round 6096 µm2 6096 µm2 0.00

Upper Round 11059 µm2 11059 µm2 0.00

Overall Design 253476 µm2 548899 µm2 116.54

56

5.3 Power Analysis

Timing Analysis of Non-Pipelined and Highly-Pipelined Design

17.79

2.71

17.73

2.71

3.3 3.42

17.29

3.3 3.253.423.25 3.25

0

2

4

6

8

10

12

14

16

18

20

Flag Lower Bound Upper Bound Lower Round Upper Round Overall Design

Module Names

Tim

ing

(in n

s)

Timing for Non-Pipelined Design (in ns) Timing for Highly Pipelined Design (in ns)

Figure 5.8: Timing Reports of Non-Pipelined and Highly-Pipelined Architectures

5.3 Power Analysis

According to Moore’s Law, the number of transistors on a chip double every

18 months. Advancements in technology are resulting in more transistors on a

wafer, increasing wafer sizes with the cost per wafer remaining approximately the

same. Amidst these improvements, the limiting factor has been the heating of the

chips due to excessive power dissipation. Efforts in designing the logic can improve

the throughput of the system, while the upcoming technologies aid in reducing the

overall area of the design. In these circumstances, power dissipation becomes a

critical performance metric.

57

5.3 Power Analysis

Area Reports for Non-Pipelined and Highly Pipelined Designs

0

100000

200000

300000

400000

500000

600000

Flag Lower Bound Upper Bound Lower Round Upper Round Overall Design

Module Names

Are

a (in

um

2 )

Area Report for Non-Pipelined Design (in um^2) Area Report for Highly-Pipelined Design (in um^2)

Figure 5.9: Area Reports of Non-Pipelined and Highly-Pipelined Architectures

The power analysis is done by using the Synopsys Power Tool, PrimePower.

PrimePower is a full-chip, dynamic power analysis tool that works at the gate level.

Its high-capacity power analysis supports industry-standard synthesis libraries and

provides the power analysis details needed to meet power specifications, while re-

ducing packaging costs and excess power consumption. It is an improvement over

Synopsys’ existing power tool, Design Power. One of its advantage is its ability to

handle designs with potential capacities of up to 10 million gates. PrimePower mod-

els pattern-dependent, capacitive switching, short-circuit and static power consump-

tion, considers instance-specific cell-state dependencies, glitches, multiple loads and

nonlinear ramp effects. To use PrimePower, the design was first synthesized to

generate a netlist file. Simulations were then run on this synthesized design with

58

5.3 Power Analysis

input vectors in excess of 1000. The resultant .vcd file contains information of the

switching activity as a result of the input vectors. This information was used by

PrimePower to make an estimate of the power dissipation. PrimePower outputs

text reports, giving a breakdown of power dissipation, and graphical reports that

can include histograms and waveforms.

This section provides results of the analysis done on different architectures of the

ALU with varying number of input vectors using PrimePower.

5.3.1 Generating Input Vectors

Automation of the power tool was done using the SSHAFT scripts. For the spec-

ified input vectors, the script determines the number of 1 → 0 and 0 → 1 transitions

for approximating the power dissipation. Hence, it is important to have as many

input vectors as possible to get an acceptable approximate of the power dissipation.

These input vectors were randomly generated using the MATLAB function rand.

The following code in MATLAB generates these input vectors:

close all;

clc;

ra = round((216 − 1) ∗ (rand(1, 500)− .5));

rb = ra + 2;

rc = round((216 − 1) ∗ (rand(1, 500)− .5));

rd = rc + 2;

The rand function in the above code generates 500 values for ra and rc between 0

and 1. These randomly generated values were then scaled appropriately to values

between -0.5 and 0.5 and represented by 16 bits. [ra, rb] and [rc, rd] form the

two input intervals. The upper bounds of these intervals were determined as shown

above since they have to be greater than (or equal to) their respective lower bounds.

59

5.3 Power Analysis

Further processing was done on the input vectors so that the special case multipli-

cation and the case of union of two disjoint sets was accounted for. The number of

random vectors generated can be varied by simply changing the arguments of the

rand function. I have analyzed the power dissipation of all designs using 500, 1000

and 2000 vectors. The results are stated in the following section.

5.3.2 Statistical Results from Power Scripts

The power dissipation is expected to increase with increase in the number of

pipeline stages. This is because the switching activity increases with the increase

in the number of registers. Table 5.9 gives the values of power dissipated in mwatts

for different architectures with 500 input vectors. Figure 5.10 is a graphical rep-

resentation of these values. As seen from the plot, the power dissipation increases

with increasing number of pipeline stages.

Table 5.9: Power Dissipation for Different Architectures with 500 Input Vectors

Pipeline Multiplier Power Dissipated (in mwatts)

Non-pipelined 2.655× 10−2

Two Stage 2.961× 10−2

Three Stage 3.443× 10−2

Four Stage 3.881× 10−2

Five Stage 4.327× 10−2

The power dissipation is also expected to increase as the number of input vec-

tors to a design increase. Table 5.10 shows the values for power dissipation of an

architecture that employs a 3-stage pipelined multiplier. Figure 5.11 is a graphical

representation where the expected results are seen.

60

5.3 Power Analysis

Power Dissipation for 500 Input Vectors

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05


Pipeline Stages

Pow

er D

issi

patio

n (in

mw

atts

)

Power Dissipation for 500 Input Vectors

Figure 5.10: Power Dissipation for 500 Input Vectors

Lastly, Table 5.11 gives the values of power dissipation for all the different archi-

tectures while Figure 5.12 is a graphical representation of the same. The values from

the table prove that the power dissipation increases with the increase in number of

pipeline stages for different number of input vectors. This trend is found to be

consistent throughout the table. The highly pipelined architecture which employs

a 3-stage pipelined multiplier shows higher power dissipation than the partially

pipelined architecture which has the 3 stage multiplier.

Having done extensive analysis on the timing, area and power dissipation of sev-

eral architectures, the ideal way to compare the architectures and arrive upon the

best solution amongst these, would be to have a performance metric which combines

61

5.3 Power Analysis

Table 5.10: Power Dissipation for 3-stage Pipelined Architecture

Number of Vectors Power Dissipation (in mwatts)

500 3.443× 10−2

1000 3.516× 10−2

2000 3.551× 10−2

Power Dissipation for different Input Vectors for 3-stage Pipelined Multiplier Design

0.0338

0.034

0.0342

0.0344

0.0346

0.0348

0.035

0.0352

0.0354

0.0356

500 1000 2000

Number of Input Vectors

Pow

er D

issi

patio

n (in

mw

atts

)

Power Dissipation for different Input Vectors

Figure 5.11: Power Dissipation for different Input Vectors for 3-stage PipelinedMultiplier Design

the effect of all of the above. The performance metric of throughput per unit power

dissipation, is one of the best way to summarize the performance of an architec-

ture. With power analysis gaining importance, this metric has become an industry

62

5.3 Power Analysis

Table 5.11: Power Dissipation for All Stages with different Input Vectors

Pipeline Stages 500 Vectors 1000 Vectors 2000 Vectors

Non-pipelined 2.655× 10−2 2.768× 10−2 2.785× 10−2

Two Stage 2.961× 10−2 3.219× 10−2 3.287× 10−2

Three Stage 3.443× 10−2 3.516× 10−2 3.551× 10−2

Four Stage 3.881× 10−2 3.957× 10−2 3.994× 10−2

Five Stage 4.327× 10−2 4.402× 10−2 4.441× 10−2

Highly Pipelined 4.207× 10−2 4.312× 10−2 4.295× 10−2

standard [36]. Table 5.12 lists these values for all the architectures, whereas Figure

5.13 is a graphical representation of the same. From the figure, it can be clearly

seen that the highly-pipelined design is by far the best architectures among all.

Table 5.12: Throughput per unit Power Dissipation for All Architectures

Architecture Operations/cycle/mwatt

Non-pipelined 20.231× 108

Two-stage 40.07× 108

Three-stage 59.727× 108

Four-stage 54.081× 108

Five-stage 47.478× 108

Highly Pipelined 71.196× 108

In summary, this section has given the results of simulations and synthesis runs

done on various architectures. Functionality of the design was verified by run-

ning simulations for 100% code coverage. Synopsys Design Compiler and Synopsys

PrimePower were the tools used for estimating the timing, area and power dissi-

pation of these architectures. These performance metrics were used to determine

63

5.3 Power Analysis

Power Dissipation of All Architectures for Different Input Vectors

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

500 1000 2000

Number of Input Vectors

Pow

er D

issi

patio

n (in

mw

atts

)

Non-Pipelined Two-stage Pipeline Three-stage PipelineFour-stage Pipeline Five-stage Pipeline Highly Pipelined Design

Figure 5.12: Power Dissipation for different Input Vectors for All Architectures

the best architecture for the design. A highly pipelined architecture with a 3-stage

pipelined multiplier was found to be the best design amongst those discussed. The

following chapter gives a summary of the work presented in the thesis along with

future work.

64

5.3 Power Analysis

Throughput per unit Power Dissipation

20.231

40.07

59.72754.081

47.478

71.196

0

10

20

30

40

50

60

70

80

Non-Pipelined Two Stage Three Stage Four Stage Five Stage Highly-Pipelined

Architecture

Ope

ratio

ns/c

ycle

/mw

att (

x 10

^8)

Throughput per unit Power Dissipation

Figure 5.13: Throughput per unit Power Dissipation for All Architectures

65

Chapter 6

Conclusions and Future Work

6.1 Conclusions

This thesis has presented several architectures for an Arithmetic Logic Unit

(ALU) employing Interval Numbers (I-ALU). The ALU is dedicated for use in ap-

plications related to the DSP and Controls field. Interval arithmetic is one of the

solutions to problems that arise due to rounding of numbers on finite precision

machines. Implementation of Interval Arithmetic in software is a slow process.

Dedicated hardware that performs this arithmetic at speeds comparable to non-

interval ALUs are necessary for this purpose. Fixed-point ALUs have advantages

over floating-point ALUs in terms of cost and complexity of the design. In many

DSP applications, the dynamic range of floating point arithmetic is not required.

Therefore, a faster and lower power fixed point hardware would be a better choice.

The work presented in this thesis gives the basic architecture of an Interval-ALU.

This ALU operates on 16-bit interval numbers in two’s complement form. The ALU

performs basic arithmetic operations of addition, subtraction and multiplication. It

has the capability of performing set operations of union and intersection on the in-

66

6.1 Conclusions

terval numbers as well. Since multiply-accumulate forms an integral part of any DSP

system, dedicated hardware is presented in this design to perform this operation.

Division is not inherently performed by the ALU. Division by degenerate powers

of two can be performed using the shift operation. The proposed design produces

results in one clock cycle, except for cases where the result is a union of two disjoint

intervals or product of two intervals such that both include a zero. Hardware sup-

port is provided to take care of these special cases so that the appropriate results

are achieved and the overall design is uniform from the timing perspective. Errors

that occur due to rounding of numbers to a certain number of bits are accounted for

by implementing the Outward Rounding algorithm in hardware. The ALU has the

provision of rounding the result to either 24-bits or 16-bits based on the application

for which the ALU is to be used.

Having provided with the basic architecture, modifications are done to improve

the performance of the design. Throughput being a critical performance metric,

pipelining is employed to increase the efficiency. Multipliers in the basic design is

replaced by pipelined multipliers of different stages. A design with a three stage

pipelined multiplier is found to be the best design after a trade-off between through-

put and area. To enhance the system performance, the design has been further

pipelined. A highly pipelined design for an Interval Arithmetic Logic Unit shows

significant improvement in throughput as compared to a non-pipelined design. The

performance improvement is achieved by reducing the critical path of the original

design. The synthesis results show a significant increase in the throughput of the

new design. Table 6.1 shows a comparison of the clock period, frequency of opera-

tion, throughput, area and initial latency between the non-pipelined and the highly

pipelined designs. An estimate of power dissipation was also obtained for all of the

architectures. From the values of timing and power dissipation obtained, a final

performance metric of throughput per unit power was evaluated to determine the

best architecture. The table reflects this metric as well. From the data in the table,

it is conclusive that the highly pipelined design is by far the best architecture among

all the architectures discussed.

67

6.2 Future Work

Performance Metric Non-pipelined Highly-pipelined

Clock Period 17.73 ns 3.25 ns

Frequency 56 MHz 307 MHz

Throughput 56× 106 op/cyc 307× 106 op/cyc

Area 253476 µm2 548899 µm2

Latency 2 7

Throughput/Power 20.231× 108 op/cyc/mw 71.196× 108 op/cyc/mw

Table 6.1: Comparison of Non-Pipelined and Highly-Pipelined Designs

6.2 Future Work

The work done in this thesis can be carried forward and put to use in certain

DSP applications. One of the possible applications is adaptive filtering. The ALU

could be employed to do the rigorous calculations needed in a FIR or IIR filter. The

numerical reliability of such interval based adaptive systems is expected to be higher

than normal non-interval systems. A comparison of the hardware specifications of

a non-interval and interval adaptive filters could be done to analyze the extra effort

that needs to be put in to achieve higher numerical reliability. In signal processing,

there exist many instances where the objective function to be minimized is not

convex (e.g. adaptive IIR filtering), whereby optimization to the global optimum

cannot be guaranteed. Interval Analysis has the potential to provide a solution to

such non-convex objective functions. A possible application of the ALU would be

in these global optimization algorithms.

As a part of the future work, the vision is to build hardware to perform complex

operations to evaluate the trigonometric, logarithmic and exponential functions.

This would require highly complex designs. Increased complexity leads to increased

power dissipation. In order to reduce the power dissipation, separate modules can

be built to evaluate these functions based on the frequency of their use. The overall

68

6.2 Future Work

design could be made modular to make it more power efficient. Such complex

modules may be included as a part of the overall design when required and may be

disconnected at other times to minimize power dissipation.

One of the long term goals is to build a co-processor based on interval arithmetic.

This ALU could be used as the heart of this co-processor. A suitable bus architecture

and I/O peripherals would be required for this. The size of the accumulator could be

increased based on the application to allow repetitive multiply-accumulate execution

as well as take care of the overflow. The format in which fixed point numbers are

represented could be based on the application it is being used for. Section 3.1.1

describes the representation of 16 bit fixed point numbers in the 8:8 form. We may

choose the 4:12, 2:14 or the 1:15 form depending on the application for which the

ALU is being used. Also, an aim is to download a major part of this design on to

a FPGA block.

69

Bibliography

[1] ANSI/IEEE, IEEE Standard for Binary Floating-Point Arithmetic, New York:

ANSI/IEEE Std 754-1985, 1985.

[2] M. J. Schulte, V. Zelov, A. Akkas, and J. C. Burley, “ Adding Interval Support

to the GNU Fortran Compiler”, Manuscript, Lehigh University, 1997.

[3] D. Chiriaev and G. W. Walster,“Interval Arithmetic Specification”,

Manuscript, 1997.

[4] Fortran 95 Interval Arithmetic Programming Reference, SUNTM Studio 11,

November 2005, Revision A, Part No. 819-3695-10.

[5] C++ Interval Arithmetic Programming Reference, SUNTM Studio 11, Novem-

ber 2005, Revision A, Part No. 819-3696-10.

[6] U. Kulisch. Advanced Arithmetic for the Digital Computer. New York:

Springer-Verlag, 2002.

[7] M. J. Schulte and E. E. Swartzlander, Jr., “A Family of Variable Precision

Interval Arithmetic Processors.” IEEE Transactions on Computers, vol. 49,

pp. 387-398, May 2000.

[8] Gene Frantz and Ray Simar, “The magazine of record for the embedded com-

puting industry”, November 2004.

[9] M. E. Jerrell, “Global Optimization using Interval Arithmetic”, Computational

Economics, vol. 7, pp. 55-62, 1994.

70

BIBLIOGRAPHY

[10] E. Hansen, Global Optimization using Interval Analysis, Marcel Dekker, 1992.

[11] R. B. Kearfott, A review of techniques in the Verified Solution of Constrained

Global Optimization, vol. 3, 1992.

[12] K. Braune, “Standard Functions for Real and Complex Point and Interval

Arguments with Dynamic Accuracy”, in Scientific Computing with Automatic

Result Verification (U. Kulisch and H. J. Stetter, eds.) pp. 159-184, Springer-

Verlag, 1989.

[13] J. Schroder, “Verification of Polynomial Roots by Closed Coupling of Com-

puter Algebra and Self-Validating Numerical Methods”, in Contributions to

Computer Arithmetic and Self-Validating Numerical Methods, (C. Ullrich, ed.),

pp. 259-269, J. C. Baltzer, 1990.

[14] H. J. Stetter, “Validated Solution of Initial Value Problem for ODE”, in Com-

puter Arithmetic and Self-Validating Numerical Methods, (C. Ullrich, ed.), pp.

171-187, Academic Press, 1990.

[15] A. Neumaier, “Interval Methods for Systems of Equations”, Cambridge Uni-

versity Press, 1990.

[16] R. B. Kearfott, M. Daw ande, K. Du, and C. Hu, “A Portable FORTRAN

77 Elementary Function Library”, Interval Computations, vol. 3, pp. 96-105,

1992.

[17] R. B. Kearfott, M. Daw ande, K. Du, and C. Hu, “Algorithm 737: INTLIB: A

Portable FORTRAN 77 Interval Standard Function Library”, ACM Transac-

tions on Mathematical Software, vol. 20, pp. 447-459, 1994.

[18] O. Knuppel, “PROFIL/BIAS - A Fast Interval Library”, Computing, vol. 53,

pp. 277-288, 1994.

[19] E. Hyvonen and S. D. Pascale, “InC++ Library Family for Interval Computa-

tions”, International Journal of Reliable Computing. Supplement to the Inter-

71

BIBLIOGRAPHY

national Workshop on Applications of Interval Computations, (V. Kreinovich,

ed.) pp. 85-90, 1995.

[20] R. Klatte, U. Kulisch, M. Neaga, D. Ratz, and C. Ullrich, “PASCAL-XSC:

Language Reference with Examples”, Springer-Verlag, 1991.

[21] R. Klatte, U. Kulisch, A. Wiethoff, C. Law o, and M. Rauch, “C-XSC: A C++

Class Library for Extended Scientific Computing.”, Springer-Verlag, 1993.

[22] W. V. Valter, “ACRITH-XSC: A Fortran-like Language for Verified Scientific

Computing” in Scientific Computing with Automatic Result Verification, (E.

Adams and U. Kulisch, eds.), pp. 45-70, Academic Press, 1993.

[23] J. S. Ely, “The VPI Software Package for Variable Precision Interval Arith-

metic” Interval Computations, vol. 2, pp. 135-153, 1993.

[24] M. J. Schulte, V. Zelov, A. Akkas, and J. C. Burley, “ The Interval-Enhanced

GNU Fortran Compiler”, Reliable Computing, vol. 5, no. 3, pp. 311-322, 1999.

[25] R. B. Kearfott et al., “A Specific Proposal for Interval Arithmetic in FOR-

TRAN”, Manuscript, University of Southwestern Louisiana, 1996.

[26] M. J. Schulte, K. C. Bickerstaff, and E. E. Swartzlander, Jr., “Hardware Inter-

val Multipliers”, Journal of Theoretical and Applied Informatics, vol. 3, no. 2,

pp. 73-90, 1996.

[27] J. E. Stine and M. J. Schulte, “A Combined Interval and Floating-point Mul-

tiplier”, in 8th Great Lakes Symposium on VLSI, pp. 208-213, February 1998.

[28] J. E. Stine and M. J. Schulte, “A Combined Interval and Floating-point Di-

vider”, in Thirty-Second Asilomar Conference on Signals, Systems and Com-

puters, pp. 218-222, November 1998.

[29] A. Akkas, “A Combined Interval and Floating-point Comparator/Selector”,

IEEE 13th International Conference on Application-specific Systems, Architec-

tures, and Processors, pp. 208-217, San Jose, USA, July, 2002.

72

BIBLIOGRAPHY

[30] M. J. Schulte and E. E. Swartzlander, Jr., “A Family of Variable-Precision,

Interval Arithmetic Processors”, IEEE Transactions on Computers, vol. 49,

pp. 387-398, May 2000.

[31] R. E. Moore, Interval Analysis, Prentice-Hall Inc., 1966.

[32] U. Kulisch, Advanced Arithmetic for the Digital Computer, New York:

Springer-Verlag, 2002.

[33] http://www.synopsys.com/products/designware/dwcores.html

[34] http : //www.synopsys.com/products/logic/design comp cs.html

[35] http : //www.synopsys.com/products/power/primepower ds.pdf

[36] D.G. Chinnery and K. Keutzer, Closing the Power Gap between ASIC and

Custom: An ASIC Perspective, University of California at Berkeley.

73

hardware-software codesign of a programmable wireless

Documents