hardware-software codesign of a programmable wireless
TRANSCRIPT
ABSTRACT
GUPTE, RUCHIR. Interval Arithmetic Logic Unit for DSP and Control Applica-
tions. (Under the direction of Prof. William W. Edmonson).
There are many applications in the field of digital signal processing (DSP) and
controls that require the user to know how various numerical errors (uncertainty)
affect the result. Interval Arithmetic (IA) eliminates this uncertainty by replacing
non-interval values with intervals. Since most DSPs operate in real time environ-
ments, fast processors are needed. The goal is to develop a platform in which
interval arithmetic operations are performed at the same computational speed as
present day signal processors.
This thesis proposes a design for an interval based arithmetic logic unit (I-ALU)
whose computational time for implementing interval arithmetic operations is equiv-
alent to many digital signal processors. Many DSP and control applications require
a small subset of arithmetic operations that must be computed efficiently. This de-
sign has two independent modules operating in parallel to calculate the lower bound
and upper bound of the output interval. The functional unit of the ALU performs
the basic fixed-point interval arithmetic operations of addition, subtraction, multi-
plication and the interval set operations of union and intersection. In addition, the
ALU is optimized to perform dot products through the multiply-accumulate instruc-
tion. Division is not implemented on digital signal processors traditionally unless
computed with a shift operation. In this design, division by shifting is implemented.
One of the prime design goals is to maximize the throughput of the ALU for
an optimum value of area. Pipelining is implemented to achieve this design goal.
Power dissipation analysis of different ALU architectures is done. Since it required
to obtain maximum throughput for the least power dissipation, throughput per
unit power dissipation is used as the most critical performance metric. This thesis
studies several architectures for the ALU and concludes with the one with the
highest performance amongst the ones which are studied.
Interval Arithmetic Logic Unit for DSP and Control Applications
by
Ruchir Gupte
A thesis submitted to the Graduate Faculty ofNorth Carolina State University
in partial fulfillment of therequirements for the Degree of
Master of Science
Electrical and Computer Engineering
Raleigh
2006
Approved By:
Dr. Winser E. Alexander Dr. William Rhett Davis
Dr. William W. Edmonson
Chair of Advisory Committee
To
Mom, Dad and Sheetal
ii
Biography
Ruchir Gupte was born on December 9th, 1982 in Mumbai, India. He received
his Bachelor of Engineering (B.E.) degree in Electronics and Telecommunications
from the Mumbai University in June 2004. In Fall 2004, he began his graduate
studies in the Electrical and Computer Engineering Department at North Carolina
State University. Since Spring 2005, he has been working in the High Performance
Digital Signal Processing (HiPerDSP) Laboratory of Dr. Winser Alexander and Dr.
William Edmonson in the field of hardware support for interval analysis.
He worked at Sony Ericsson Mobile Communications Inc., Raleigh, as an intern
from May 2005 to August 2005. He has also taken keen interest in community
participation and has been a committee member of the NC State Indian Graduate
Student Association called MAITRI. Moreover, he has extended his support in volu-
teering for various events organized by the Department of Electrical and Computer
Engineering North Carolina State University.
iii
Acknowledgements
Above all, I thank my parents for the much needed motivation throughout the
duration of my stay away from home. It was their love and support that helped
me maintain sanity during stressful times. My sister, Sheetal, has been a great
inspiration for me throughout.
I sincerely acknowledge the efforts of Dr. William Edmonson, my academic advi-
sor, in providing guidance and encouragement for the successful completion of this
thesis. Dr. Edmonson has made available all resources that I could possibly need
and also allowed the independence of applying my ideas in this project. I am deeply
indebted to him for his patience and invaluable suggestions during the course of this
project.
I am also grateful to the other members of my thesis committee, Dr. Winser
Alexander and Dr. Rhett Davis for devoting their time and providing useful inputs.
Completion of this work would not have been possible without their guidance.
I sincerely wish to express my gratitude to the HiPer DSP Research group for
creating an environment that has been fabulous for research and fun. Additional
thanks to Ramsey Hourani and Senanu Ocloo for their unconditional help through-
out my stay in the group. The encouragement and moral support extended by all
members of the group through good and hard times cannot be described in words.
Special thanks to Ravi Jenkal for his inputs and help.
Finally, I would like to thank those near and dear to me, without whose backing,
this thesis would have been a distant reality. I am grateful to all my friends at
Raleigh for being there, a special mention to my roommate, Karan Tewari, for his
steady support and friendship.
iv
Contents
List of Tables vii
List of Figures viii
1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Interval Arithmetic 92.1 Interval Arithmetic and Set Operations . . . . . . . . . . . . . . . . 10
3 Design Specifications of the ALU 163.1 Fixed-point two’s complement arithmetic . . . . . . . . . . . . . . . 16
3.1.1 Representation of numbers . . . . . . . . . . . . . . . . . . . 173.1.2 Arithmetic Operations . . . . . . . . . . . . . . . . . . . . . 18
3.2 Outward/Directed Rounding . . . . . . . . . . . . . . . . . . . . . . 21
4 Hardware Architecture 254.1 Overall Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 254.2 Flag Generator Module . . . . . . . . . . . . . . . . . . . . . . . . . 274.3 Lower Bound and Upper Bound Modules . . . . . . . . . . . . . . . 28
4.3.1 Functional Units and Control Logic . . . . . . . . . . . . . . 314.3.2 Special Case Multiplication Block . . . . . . . . . . . . . . . 324.3.3 Multiply-Accumulate Block . . . . . . . . . . . . . . . . . . 34
4.4 Rounding Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354.5 Pipeline Architecture of the Design . . . . . . . . . . . . . . . . . . 38
4.5.1 Need for Pipelining . . . . . . . . . . . . . . . . . . . . . . . 384.5.2 Partially Pipelined Design . . . . . . . . . . . . . . . . . . . 39
v
vi
4.5.3 Highly Pipelined Design . . . . . . . . . . . . . . . . . . . . 41
5 Testing and Results 445.1 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.2 Synthesis Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.1 Non-Pipelined Architecture . . . . . . . . . . . . . . . . . . 475.2.2 Design with Pipeline Multipliers . . . . . . . . . . . . . . . . 495.2.3 Highly Pipelined Design . . . . . . . . . . . . . . . . . . . . 54
5.3 Power Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.3.1 Generating Input Vectors . . . . . . . . . . . . . . . . . . . 595.3.2 Statistical Results from Power Scripts . . . . . . . . . . . . . 60
6 Conclusions and Future Work 666.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 666.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Bibliography 70
List of Tables
2.1 Nine Cases in Multiplication . . . . . . . . . . . . . . . . . . . . . . 12
3.1 Two’s complement fixed-point representation . . . . . . . . . . . . . 18
4.1 Description of ALU Inputs . . . . . . . . . . . . . . . . . . . . . . . 274.2 Description of ALU outputs . . . . . . . . . . . . . . . . . . . . . . 274.3 mul flag for the Multiplication operation . . . . . . . . . . . . . . . 284.4 Flag Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1 Timing Reports for the Non-Pipelined Architecture . . . . . . . . . 475.2 Area Reports of Non-Pipelined Architecture . . . . . . . . . . . . . 495.3 Timing Reports for the Pipelined Multipliers . . . . . . . . . . . . . 515.4 Timing Reports for various Pipelined Architectures . . . . . . . . . 515.5 Area Reports for various Pipelined Architectures . . . . . . . . . . . 525.6 Timing Reports for Non-Pipelined and Highly-Pipelined Architectures 565.7 Results for Non-Pipelined and Highly-Pipelined Architectures . . . 565.8 Area Reports for Non-Pipelined and Highly-Pipelined Architectures 565.9 Power Dissipation for Different Architectures with 500 Input Vectors 605.10 Power Dissipation for 3-stage Pipelined Architecture . . . . . . . . . 625.11 Power Dissipation for All Stages with different Input Vectors . . . . 635.12 Throughput per unit Power Dissipation for All Architectures . . . . 63
6.1 Comparison of Non-Pipelined and Highly-Pipelined Designs . . . . 68
vii
List of Figures
3.1 Two’s complement number representation . . . . . . . . . . . . . . 17
4.1 Top Level Block Diagram of the ALU . . . . . . . . . . . . . . . . . 264.2 Flag Generation Module . . . . . . . . . . . . . . . . . . . . . . . . 294.3 Block Diagram of Lower Bound Module . . . . . . . . . . . . . . . . 304.4 Block Diagram of Upper Bound Module . . . . . . . . . . . . . . . 304.5 Lower Bound Module . . . . . . . . . . . . . . . . . . . . . . . . . . 324.6 Upper Bound Module . . . . . . . . . . . . . . . . . . . . . . . . . . 334.7 Special Case Multiplication . . . . . . . . . . . . . . . . . . . . . . 344.8 Multiply-Accumulate Module . . . . . . . . . . . . . . . . . . . . . 354.9 Lower Bound Rounding . . . . . . . . . . . . . . . . . . . . . . . . 364.10 Upper Bound Rounding . . . . . . . . . . . . . . . . . . . . . . . . 374.11 Critical Path . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394.12 Non-Pipelined Multiplier Architecture . . . . . . . . . . . . . . . . 404.13 Two-stage Pipelined Multiplier Architecture . . . . . . . . . . . . . 404.14 Three-stage Pipelined Multiplier Architecture . . . . . . . . . . . . 404.15 Four-stage Pipelined Multiplier Architecture . . . . . . . . . . . . . 414.16 Five-stage Pipelined Multiplier Architecture . . . . . . . . . . . . . 414.17 Highly Pipelined Architecture . . . . . . . . . . . . . . . . . . . . . 42
5.1 Simulation Results for Add, Subtract and Multiply . . . . . . . . . 465.2 Simulation Results for Interval Union and Intersection . . . . . . . 465.3 Timing Reports of Non-Pipelined Architecture . . . . . . . . . . . . 485.4 Area Reports for the Non-Pipelined Architecture . . . . . . . . . . 505.5 Timing Reports of Different Pipelined Architecture . . . . . . . . . 535.6 Area Reports of Different Pipelined Architecture . . . . . . . . . . . 545.7 Timing and Area Reports of Different Pipelined Architectures . . . 555.8 Timing Reports of Non-Pipelined and Highly-Pipelined Architectures 575.9 Area Reports of Non-Pipelined and Highly-Pipelined Architectures 585.10 Power Dissipation for 500 Input Vectors . . . . . . . . . . . . . . . 61
viii
ix
5.11 Power Dissipation for different Input Vectors for 3-stage PipelinedMultiplier Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.12 Power Dissipation for different Input Vectors for All Architectures . 645.13 Throughput per unit Power Dissipation for All Architectures . . . . 65
Chapter 1
Introduction
Interval Arithmetic (IA) performs computations on intervals of real numbers
instead of real numbers themselves. It takes into account the numerical errors that
occur due to performing arithmetic on a computer. This is a problem that occurs
on all computers that make use of binary number systems, such as the IEEE 754
Standard for Binary Floating-Point Number Systems [1]. As of now, implementa-
tion of interval arithmetic is performed in software. The GNU Fortran Compiler
has been modified to provide support for an interval data type [2], based on the
Interval Arithmetic Specification [3]. The SUN Studio Fortran95 compiler provides
support for interval operations [4]. The SUN studio has compilers and tools that
support C and C++ development as well [5]. The main disadvantage of software
implementation is the slow speed. They incur tremendous overhead due to reasons
such as changing of rounding modes, function calls, exception handling, memory
management et al. For instance, the operation of multiplication requires several
conditional branches to determine which interval end-points need to be multiplied.
Based on the values of the input intervals relative to zero, nine different cases of
multiplication have to be accounted for to select the end-points. A large number
of conditional statements are required to select between these multiplication cases.
The extra work due to individual operations being done sequentially requires several
1
Chapter 1 Introduction
time consuming steps. The performance penalty to be paid for misprediction of con-
ditional branches is quite heavy in the case of fully pipelined processors. Changing
of rounding modes in software also requires a large number of computational cycles.
On many processors, changing the rounding mode causes the entire floating-point
pipeline to be flushed, which results in a delay of several cycles and severely limits
parallel execution. Furthermore, software implementations of interval multiplica-
tion are typically implemented as subroutines, which adds overhead for subroutine
calls and returns.
Thus, interval algorithms end up running slower on current computer architec-
tures compared to their real arithmetic counterparts [6]. Software implementations
are found to be as much as four times slower than functionally equivalent hardware
[7]. Hardware support is required to overcome these performance drops caused by
the above software issues.
Applications of digital signal processing (DSP) involve a very large number of
arithmetic operations, and the necessity of obtaining accurate results makes it im-
perative to perform reliable numerical computations. The goal of this design is to
solve problems in the DSP field with higher accuracy and at a faster rate. Since
software implementations are slow, it is necessary to build dedicated hardware in or-
der to achieve this goal. Interval methods form one of the solutions to reduce errors
resulting from numerical computations. This was the motivation behind building
an Interval ALU dedicated to DSP and Control applications.
Interval based algorithms continue to find applications as the solution for signal
processing and controls problems. For instance, in signal processing, there is usually
the need to determine the optimal solution to a problem, i.e., to minimize a cost
function. The ability of interval global optimization approaches to guarantee con-
vergence to global minimum point(s) is one that makes such approaches attractive
in DSP and control applications. Having optimum hardware for global optimiza-
tion, could reduce some of the overhead associated with a complete software solution
2
Chapter 1 Introduction
mentioned earlier. DSP and control algorithms need to be designed in such a way
that roundoff and truncation errors that occur naturally due to the discrete nature
of computing do not prevent the algorithm from converging to the global minimum.
Interval analysis provides a means of managing such errors. It is therefore possible
to obtain numerically accurate and reliable results. Reliable results may be defined
as the solutions in which the obtained value is guaranteed to be the exact result
of the operation being performed. Interval analysis gives an interval as the output
which certainly contains the exact value of the result expected from an operation
performed on two input interval numbers. It is thus capable of providing reliable
results.
These results can be achieved by using arithmetic logic units (ALU) that are
specially designed to manipulate interval numbers. Such an Interval ALU (I-ALU)
can be used as the core of a digital signal processor. In contrast to general pur-
pose microprocessors that are designed to handle general computing tasks, digital
signal processors are designed and optimized to operate on algorithms that are
characterized by repetitive multiply-and-add operations. In general, they feature
fast multiply-accumulate instructions, multiple-access memory, specialized program
control for interrupt handling and I/O, and fast and efficient access to peripherals.
We desire to achieve maximum efficiency while providing these features that make
up a good digital signal processor.
Throughput is the most important metric to analyze the performance of a DSP
system. The throughput problem will have to be solved for interval algorithms to
become more practical. The throughput of an I-ALU will have to be comparable to
that of non-interval units. Pipelining provides an effective solution to improving the
throughput of the design. By definition, pipelining is an implementation technique
where multiple instructions are overlapped in execution. Each stage completes a
part of an instruction in parallel. Pipelining does not decrease the time for individual
instruction execution. It increases instruction throughput, instead. Throughput of
a system is directly related to the depth of the pipeline. However, increasing the
3
1.1 Motivation
depth of the pipeline adversely affects the area of the design. Hence an optimum
design would involve a proper trade-off between the throughput and area, where
the throughput would have more importance in signal processing applications.
1.1 Motivation
Digital Signal Processing has become the choice for many applications related
to communications, control, multimedia, et al. because of the high performance it
achieves for applications that involve limited instruction set for implementing repet-
itive linear operations such as addition, multiplication, delay, et al. on a stream of
sampled data. Often a DSP has been used as an attached coprocessor or combined
with one or more FPGA devices to meet the performance and cost requirements
for a particular application. Common to DSPs is the ability to perform multiply-
accumulate (MAC) operations in a single instruction cycle. This operation is key to
performing vector products which is key to computing fourier transforms and corre-
lation. Other features of a DSP include the ability for accessing multiple memories,
dedicated address units for simultaneous access to data memories and program mem-
ory modulo addressing. Several DSP manufacturers also include specialized periph-
erals along with fast interrupt handling. Examples of these specialized peripherals
include analog-digital converters and I/O for multiprocessor communications.
Underlying many of these applications is the need for accurate and reliable re-
sults, but errors due to rounding, uncertainty of the data, quantization noise and
catastrophic cancelation in floating point computations can lead to inaccuracies.
Sometimes these inaccuracies can go unnoticed. For many applications in signal
processing, operations are recursive and act on a sequence of data. This implies
that numerical errors can grow unbounded over time. An efficient method for mon-
itoring and controlling these inaccuracies is to replace point arithmetic with interval
arithmetic. Interval methods are capable of bounding these inaccuracies.
4
1.2 Background
Digital signal processors represent one of the fastest growing segments of the
embedded world. Despite their vast use, DSPs present difficult challenges for pro-
grammers. Since computation speed is of critical importance to DSP applications,
DSPs focus on supporting fixed-point operations.
Use of fixed-point representation not only requires the programmers to deal with
mathematically sophisticated equipment, but also are required to deal with errors
that are introduced due the use of reduced-precision arithmetic. Although it would
be ideal to use floating-point arithmetic over fixed point, as a practical considera-
tion, fixed-point processors operate at a much faster rate than their floating-point
counterparts. Fixed-point DSPs execute at gigahertz range; floating-point DSPs
peak out in the 300-400 megahertz range. Fixed-Point DSPs enjoy another advan-
tage of being consumed in large volumes as a result of which their price per chip is a
fraction of the price of a floating point DSPs [8]. Fixed-point processors gain speed
and power efficiency over floating-point processors at the cost of reduced precision.
However, DSP applications rarely require the full dynamic range offered by floating-
point number system. This justifies the choice of using fixed point arithmetic for
our ALU design over floating point arithmetic.
As mentioned earlier, hardware solutions are needed over software implementation
to solve the speed problem. This brings up the idea of building an Arithmetic Logic
Unit dedicated to perform arithmetic operations on interval inputs in the fixed
point representation. The aim is to develop hardware that is optimized for interval
operations pertinent to signal processing applications such as addition, subtraction,
multiplication and multiply-accumulate.
1.2 Background
Interval algorithms have found their usage in applications such as global opti-
mization [9], [10], [11], function evaluation [12], finding roots of polynomials [13],
5
1.2 Background
solving differential equations [14], solving non-linear equations [15] et al. In most of
these cases, interval arithmetic is used to solve problems which cannot be efficiently
solved using conventional floating-point arithmetic.
Several software tools have been developed to support interval arithmetic. These
include interval arithmetic libraries in Fortran [16], [17], [18], C++ [19], extended
scientific programming languages such as PASCAL [20], C++ [21], Fortran [22],
[23] and interval-enhanced compilers [24]. Inspite of these developments in the field,
interval arithmetic has not gained popularity owing to the speed issues when com-
pared to conventional floating point methods. It is believed that the performance
of interval arithmetic needs to be within a factor of five of floating-point arithmetic
for it to gain general acceptance [25]. Hardware support for interval arithmetic is
required to achieve this high performance. Several interval based hardware designs
have been implemented, a few of which have been listed below:
• Hardware Interval Multipliers [26]
The author presents serial and parallel hardware units for interval multiplica-
tion. These units provide automatic interval end-point selection and correct
rounding of results. While the serial interval multiplier uses a single multiplier
unit, the parallel multiplier uses dual multipliers to compute the interval end-
points simultaneously. These multipliers provide a significant performance
boost for acceptable increases in area.
• A Combined Interval and Floating Point Multiplier [27]
This design is based on the approach that an interval multiplier can share
hardware with a existing floating point multiplier, thereby achieving the per-
formance benefits of a interval multiplier at relatively low costs. The design
resorts to software solutions to solve the uncommon case of multiplication
where both end-points contain zero. Interval multiplication requires only one
more cycle than floating point multiplication, and is one to two orders of
magnitude faster than software implementations of interval multiplication.
6
1.3 Contribution
• A Combined Interval and Floating Point Divider [28]
This design follows a similar approach as above, where an existing floating
point divider is modified to enable interval division on it. Based on the values
of interval inputs relative to zero, seven different cases of division are ad-
dressed. Interval division can be performed after modifying the floating point
divider with a 24% increase in area.
• A Combined Interval and Floating Point Comparator [29]
This design is an implementation of a combined interval and floating-point
comparator, which performs interval intersection, hull, minimum, maximum
and comparisons, as well as floating-point minimum, maximum and compar-
isons. It has around 98% more area than conventional floating-point compara-
tors and a worst case delay that is 42% greater.
• Variable Precision Interval Arithmetic Processors [30]
The author presents designs, arithmetic algorithms and software support for
a family of variable precision, interval arithmetic processors. These processors
give the programmer the ability to detect, and if desired, to correct the implicit
errors in finite precision numerical computations. The processors are two to
three orders of magnitude faster than software packages that provide similar
functionality.
However, all of the above architectures are designed for floating-point represen-
tation of numbers. Although these are high-precision computational units, they
have lower throughput than their potential fixed-point counterparts. As mentioned
earlier, they also have higher design complexity and hence are undesirable for DSP
applications.
1.3 Contribution
The following thesis designs the hardware architecture of an I-ALU and optimizes
it to achieve maximum efficiency. The whole architecture is based on fixed-point
7
1.4 Thesis Organization
two’s complement representation of numbers. Two’s complement representation
is most convenient to perform arithmetic because of its uniformity over positive
and negative numbers while performing operations and rounding. Although fixed-
point arithmetic reduces the precision of results, the precision provided by it is
sufficient for DSP related applications. Besides, it has the added advantage of
reduced complexity of the design and higher throughput. A basic hardware model
has been built at the RTL level of abstraction, and the design has been modified
for better efficiency by use of pipelined multipliers of increasing depths. These
designs have been explored and statistical data, based on the results of simulations
and synthesis, has been used to determine the most optimal solution. Throughput,
area, power dissipation and numerical reliability are the performance metrics used
for system evaluation.
1.4 Thesis Organization
Chapter 2 introduces the reader to the concept and conventional representation
of Interval Numbers. Various arithmetic and set operations which can be performed
on these numbers by the I-ALU are discussed in detail in this chapter. Chapter 3
provides the design specifications of the proposed ALU. The significance of using
two’s complement representation of numbers can be seen here along with the details
of the rounding issue. Chapter 4 describes in detail, the hardware architecture of
the ALU. A comprehensive description of each module constituting the ALU has
been given here. The issue of rounding has been addressed. Chapter 5 provides
the results of simulation runs and synthesis performed on different architectures of
the design. An exhaustive comparison of the results from the non-pipelined design
and various versions of the pipelined design has been done to arrive at an optimal
solution. Throughput, area and power dissipation are used as the performance met-
rics. Finally, I conclude my work with Chapter 6 providing the details of the future
work that can be done on the design to broaden its scope, improve functionality
and efficiency.
8
Chapter 2
Interval Arithmetic
In the words of Ramon E. Moore [31], “If we have, in addition to the results of
a computation, error bounds for the differences between the results and the exact
solution values, then no matter how these error values were obtained, by analytical
means or by further machine computations during or after the given computation,
it will always be the case that we have, in effect, for each exact result sought, a pair
of numbers: an approximate value and an error bound, or an upper and a lower
bound to the exact result.”
Real numbers can be infinite precision. All machines are inherently finite preci-
sion. Owing to this nature, real numbers are approximated to get them to machine
representable forms. This error bound may be considered as an uncertainty. Interval
Analysis is a means of representing this uncertainty by replacing single (fixed-point)
values by intervals. It provides a means of bounding the errors that accrue due to
the discrete nature of computing.
An interval number is defined to be an ordered pair of real numbers [a, b], such
that a ≤ b. Using the notation {x |P(x)} for “the set of x such that the proposition
P(x) holds,” we can write
9
2.1 Interval Arithmetic and Set Operations
[a, b] = {x |a ≤ x ≤ b } where x ∈ <
Using this convention, real numbers are represented as intervals with identical up-
per and lower bounds. Such intervals are called “Degenerate Intervals” and appear
to have the form [a, a]. The usual operations of addition, subtraction, multiplica-
tion, and division that are possible with real numbers are also defined for interval
numbers [32].
An interval number is also a set of real numbers. The interval number [a, b]
is a set of real numbers x such that a ≤ x ≤ b. Hence, set operations of union
and intersection can also be done on interval numbers. Section 2.1 discusses in
depth, the various arithmetic and set operations performed by the proposed ALU.
In interval arithmetic, the true result is guaranteed to lie within the resulting in-
terval. This is achieved by the Outward Rounding algorithm. Outward rounding
on an interval X = [a, b] is achieved by rounding the lower bound, a, to the largest
machine-representable number smaller than a, and the upper bound b, to the small-
est machine-representable number larger than b. This involves the use of the round
down and round up modes on the lower and upper bounds, respectively. Directed
Rounding capabilities, that is, the ability to round down or round up has been
available since the Intel 8087 chip. As a result, interval arithmetic is possible on
virtually any computer.
2.1 Interval Arithmetic and Set Operations
As described in the previous section, interval numbers are represented by an
ordered pair [a, b] such that a ≤ b. The arithmetic interval operations of addition,
subtraction and multiplication will be discussed in this section. The rules for ef-
ficiently implementing these operations are listed here. In addition, the rules for
10
2.1 Interval Arithmetic and Set Operations
the set operations of union and intersection along with the calculation of width and
mid-point of a single interval and also described.
In the following discussion, we consider two input interval numbers. They are
represented as [xL, xU ] and [yL, yU ], where xL and yL are the lower bounds and xU
and yU are the upper bounds of the two intervals. Except for one special case of set
union of two disjoint sets, all operations result in a single output interval number.
The outputs of various interval operations are obtained as follows:
• Addition
Addition of interval numbers is a straightforward operation where the lower
bound of the output interval is obtained from the sum of the lower bounds of
the input intervals, while the upper bound of the output interval is obtained
from the sum of the upper bounds of the input intervals.
Mathematically, this can be represented as:
[xL, xU ] + [yL, yU ] = [xL + yL, xU + yU ]
• Subtraction
In subtraction, lower bound of the output interval is obtained by subtracting
the upper bound of one interval number from the lower bound of the other
interval number. Similarly, upper bound of the output interval is obtained
by subtracting the lower bound of the second interval number from the upper
bound of the first interval number.
Mathematically, this can be represented as:
[xL, xU ] - [yL, yU ] = [xL − yU , xU − yL]
11
2.1 Interval Arithmetic and Set Operations
Table 2.1: Nine Cases in Multiplication
Case Condition Result
1 xL ≥ 0; yL ≥ 0 [xLyL,xUyU ]
2 xL ≥ 0; yL < 0 < yU [xUyL,xUyU ]
3 xL ≥ 0; yU ≤ 0 [xUyL,xLyU ]
4 xL < 0 < xU ; yL ≥ 0 [xLyU ,xUyU ]
5 xL < 0 < xU ; yU ≤ 0 [xUyL,xLyL]
6 xU ≤ 0; yL ≥ 0 [xLyU ,xUyL]
7 xU ≤ 0; yL < 0 < yU [xLyU ,xLyL]
8 xU ≤ 0; yU ≤ 0 [xUyU ,xLyU ]
9 xL < 0 < xU ;yL < 0 < yU [min(xUyL,xLyU),
max(xLyL,xUyU)]
• Multiplication
Multiplication presents a more difficult problem than addition and subtrac-
tion. Unlike these two operations, apart from the magnitude, the sign of the
operands also needs to be taken into consideration. Both, sign and magni-
tude of operands decide which two values are to be multiplied to obtain the
lower and upper bounds separately. Under normal circumstances, the result
of multiplication of two input intervals would be obtained as follows:
If [xL, xU ] ∗ [yL, yU ] = [zL, zU ], then,
zL = min(xLyL, xLyU , xUyL, xUyU) and
zU = max(xLyL, xLyU , xUyL, xUyU)
These computations require eight multiplications and several comparisons
to be performed before the lower and upper bounds of the intervals can be
obtained. This makes the multiplication operation highly inefficient. To over-
come this problem, the multiplication operation is split into 9 different cases
based on the values of the operands with respect to zero. Table 2.1 lists these
9 cases and gives the output interval for each of them.
12
2.1 Interval Arithmetic and Set Operations
From this table, it can be observed that the task of obtaining the lower
bound and the upper bound of the output interval is reduced to two multipli-
cations compared to the eight multiplications that were required when a brute
force method was followed. Comparisons of the input values need to be done
initially to determine which case they belong to. Reduction in the number of
multiplications required to be done to determine the output helps in making
the design more hardware efficient.
A special mention needs to be made of case 9, where both the input intervals
include zero in them. Although this would be a rare case in high resolution
processors, it needs to be addressed for the purpose of numerical reliability. As
can be seen from the table, the output for this case requires 4 multiplications
and 2 comparisons to be performed. This case leads to increased complexity
in the design and from the hardware point of view requires double the amount
of computational time as compared to other operations.
• Union of Interval Numbers
Union of interval numbers is done in the same way as the union operation
in set theory. By definition, for two sets A and B, (A ∪ B) is defined as a
set containing all elements of set A and all elements of set B. Similarly for
interval numbers, to perform the union operation, the lower bound is obtained
by determining the minimum value of the lower bounds of the two input
intervals while the upper bound is obtained by determining the maximum
value of the upper bounds of the input intervals.
Mathematically, this can be represented as:
[xL, xU ] ∪ [yL, yU ] = [min(xL, yL), max(xU , yU)]
For interval numbers, the union of two disjoint sets has to be dealt with
separately. This case results in two output intervals instead of one, the out-
13
2.1 Interval Arithmetic and Set Operations
put intervals being exactly equal to each of the input intervals. Amongst all
operations performed by the ALU, this is the only operation which results in
two output intervals.
Mathematically, this can be represented as:
For two disjoint intervals, [xL, xU ] and [yL, yU ],
[xL, xU ] ∪ [yL, yU ] = [xL, xU ] + [yL, yU ]
• Intersection of Interval Numbers
Intersection of interval numbers is done in the same way as the intersection
operation in set theory. By definition, for two sets A and B, (A ∩ B) is
defined as a set containing only those elements that belong to set A and to set
B. For the intersection operation, the lower bound is obtained by determining
the maximum value of the lower bounds of the two input intervals while the
upper bound is obtained by determining the minimum value of the upper
bounds of the input intervals.
Mathematically, this can be represented as:
[xL, xU ] ∩ [yL, yU ] = [max(xL, yL), min(xU , yU)]
A null set is obtained for the intersection of two disjoint sets.
• Width
The “width” operation is performed on a single interval. Width of an interval
is defined as the difference between the upper bound and lower bound of the
interval. The output is naturally a single value.
Mathematically, this can be represented as:
14
2.1 Interval Arithmetic and Set Operations
width[xL, xU ] = xU − xL
• Mid-point
The “mid-point” operation is also performed on a single interval. Mid-point
of an interval is obtained by taking the average of the lower bound and upper
bound of the input interval. Once again, the output is a single value.
Mathematically, this can be represented as:
midpoint [xL, xU ] = (xU + xL)/2
Division by 2 is performed by right shifting the sum of the two bounds of the
interval by one bit. Sign extension by one bit also needs to be done.
The operations described above will be implemented on the proposed I-ALU.
Unique to this work will be the fact that these operations in conjunction with
directed rounding will be based on fixed point arithmetic.
15
Chapter 3
Design Specifications of the ALU
The ALU is based on a parallel architecture where computation of the lower
bound and the upper bound of the output interval is simultaneously done. The
design is built for fixed point operation using the two’s complement representation
of numbers. Fixed-point two’s complement interval arithmetic and rounding are
described in detail in this section.
3.1 Fixed-point two’s complement arithmetic
The main focus of this design is to build a fixed-point interval arithmetic and
logic unit as against certain floating-point interval units that have been designed
previously and discussed in brief in section 1.2. To this end, it is important to get
familiar with the operations performed on fixed-point numbers. This section is an in
depth study on the working of fixed-point arithmetic. It explains the functionality of
the three basic operations viz. addition, subtraction and multiplication, performed
on fixed-point numbers. Given that our design is oriented towards DSP related
applications, division is not performed in hardware. Division by powers of 2 is
performed by implementing the shift operation. It is easy to model our hardware
16
3.1 Fixed-point two’s complement arithmetic
based on the application for which the ALU is going to be used once we have a proper
understanding of the working of fixed-point arithmetic. Irrespective of being a real
number ALU or an ALU for Interval Arithmetic, the logic behind the mathematics
that is being performed remains the same.
As the most generalized case, two’s complement format for representing the fixed-
point numbers has been used. It accounts for operations performed on both positive
and negative numbers.
3.1.1 Representation of numbers
In the binary number system, an N-bit word represents integer values from 0
to 2N − 1. This is referred to as the unsigned integer representation. The fixed-
point representation has predefined position of the radix point, which means that
we have fixed number of bits reserved for the integer part and a fixed number for
the fractional part. A 32-bit number having 16 bits reserved for integer part and 16
for the fractional part is represented as 16:16. However, this mode lacks the ability
to represent negative numbers.
Twos complement method of representing fixed-point numbers accounts for both
positive and negative numbers. The MSB of this fixed-point number indicates the
sign (referred to as the sign-bit), whereas the rest of the bits define the magnitude of
the number. Figure 3.1 shows the structure of an N-bit signed number in twos com-
plement format as used in this implementation. The range of numbers represented
0 1 N -1
sign fraction
Figure 3.1: Two’s complement number representation
by an N-bit word is from −2N−1 to 2N−1 − 1. Two’s complement representation
17
3.1 Fixed-point two’s complement arithmetic
Table 3.1: Two’s complement fixed-point representation
Binary Two’s Complement Decimal Equivalent
00010110.11000000 22.75
11101001.01000000 -22.75
00001000.00100000 8.125
11110111.11100000 -8.125
of numbers greatly simplifies the hardware implementation of the arithmetic being
performed. Table 3.1 provides a few examples of 16 bit two’s complement fixed-point
numbers in the 8:8 format.
3.1.2 Arithmetic Operations
This section provides examples of various operations performed on two’s com-
plement fixed point numbers. Three basic operations of addition, subtraction and
multiplication are considered. Let us go through each of these operations one at a
time.
• Addition
Addition involves simple addition of bits when the number is represented in
two’s complement form. The two operands are sign extended from 16 bits to
17 bits and the 17th bit of the result of addition is then sign extended to obtain
the 32 bit output. The following examples illustrate the addition operation:
1. 22.75 + (-8.125) = 14.625
0 0 0 0 1 0 1 1 0 . 1 1 0 0 0 0 0 0
+ 1 1 1 1 1 0 1 1 1 . 1 1 1 0 0 0 0 0
0 0 0 0 0 1 1 1 0 . 1 0 1 0 0 0 0 0
The 17th bit, 0, is used for sign extension and 7 zeros are added to
the left. 8 zeros are added to the right to perform sign extension of the
18
3.1 Fixed-point two’s complement arithmetic
decimal part. Hence 00001110.10100000 in the 8:8 format is represented
as 0000000000001110.1010000000000000 in the 16:16 format.
2. (-8.125) + (-8.125) = (-16.25)
1 1 1 1 1 0 1 1 1 . 1 1 1 0 0 0 0 0
+ 1 1 1 1 1 0 1 1 1 . 1 1 1 0 0 0 0 0
1 1 1 1 0 1 1 1 1 . 1 1 0 0 0 0 0 0
The 17th bit, 1, is used for sign extension and 7 ones are added to
the left. 8 zeros are added to the right of the number to perform sign
extension of the decimal part. Hence 11101111.11000000 in the 8:8 format
is represented as
1111111111101111.1100000000000000 in the 16:16 format.
Thus, in terms of hardware, the 16 bit number needs to be sign extended to
17 bits and the 17th bit of the result needs to be used for sign extension.
• Subtraction
Similar rules as followed in addition need to be followed while performing
subtraction of numbers in the two’s complement form. The only change that
needs to be done is that, we need to take the two’s complement of the number
to be subtracted and then add it to the other number which is in its two’s com-
plement form. The remaining procedure remains unchanged. The following
examples illustrate the subtraction operation:
1. 22.75 - 8.125 = 14.625
22.75 in two’s complement form is represented as 00010110.11000000.
8.125 in two’s complement form is represented as 00001000.00100000.
Its two’s complement is 11110111.11100000. Hence,
0 0 0 0 1 0 1 1 0 . 1 1 0 0 0 0 0 0
+ 1 1 1 1 1 0 1 1 1 . 1 1 1 0 0 0 0 0
0 0 0 0 0 1 1 1 0 . 1 0 1 0 0 0 0 0
19
3.1 Fixed-point two’s complement arithmetic
The 17th bit, 0, is used for sign extension and 7 zeros are added to
the left. 8 zeros are added to the right of the number to perform sign
extension of the decimal part. This gives us the desired result of the
subtraction 14.625.
2. 8.125 - 22.75 = (-14.625)
8.125 in two’s complement form is represented as 00001000.00100000.
22.75 in two’s complement form is represented as 00010110.11000000.
It’s two’s complement is 11101001.01000000. Hence,
0 0 0 0 0 1 0 0 0 . 0 0 1 0 0 0 0 0
+ 1 1 1 1 0 1 0 0 1 . 0 1 0 0 0 0 0 0
1 1 1 1 1 0 0 0 1 . 0 1 1 0 0 0 0 0
The 17th bit, 1, is used for sign extension and 7 ones are added to
the left. 8 zeros are added to the right of the number to perform sign
extension of the decimal part. This gives us the desired result of the
subtraction -14.625.
3. 22.75 - (-8.125) = 30.625
22.75 in two’s complement form is represented as 00010110.11000000.
8.125 in two’s complement form is represented as 11110111.11100000.
It’s two’s complement is 00001000.00100000. Hence,
0 0 0 0 1 0 1 1 0 . 1 1 0 0 0 0 0 0
+ 0 0 0 0 0 1 0 0 0 . 0 0 1 0 0 0 0 0
0 0 0 0 1 1 1 1 0 . 1 1 1 0 0 0 0 0
The 17th bit, 0, is used for sign extension and 7 zeros are added to
the left. 8 zeros are added to the right of the number to perform sign
extension of the decimal part. Thus, we obtain the desired result 30.625.
• Multiplication
Multiplication of two N-bit numbers gives a 2N-bit result. Consequentially,
20
3.2 Outward/Directed Rounding
the issue of sign extension is not involved in multiplication. Multiplication of
two 16:16 numbers will result in a 32:32 number. In my examples, I consider
numbers of the 4:4 format. I have illustrated a couple of examples to explain
the multiplication operation:
1. 1.25 ∗ 3.25 = 4.0625
0001.0100 ∗ 0011.0100 = 00000100.00010000
2. 7.9375 ∗ 7.9375 = 63.00390625
0111.1111 ∗ 0111.1111 = 00111111.00000001
If any of the multiplicand is a negative number, we have to first take the
two’s complement of that number and then perform the usual multiplication
as explained above. The sign of the result will depend on the number of two’s
complements that we have taken before performing the multiplication. In
hardware, this is achieved by doing the exclusive-OR of the two sign bits.
Addition, subtraction and multiplication are three, very important operations
performed by the I-ALU. Functionality of all these three operations has been elab-
orately described in this section. This study goes a long way in determining the
hardware architecture of the system. Besides these operations, multiply-accumulate
forms the heart of any DSP processor. Special emphasis has been laid on this in
the following sections. Division by numbers other than degenerate powers of 2 has
less occurrence in DSP related applications. Division is implemented only by the
shift operation because of the cost of using division with respect to time and area.
Section 3.2 addresses the important issue of rounding.
3.2 Outward/Directed Rounding
For most systems, although the internal buses of an ALU may be wide enough,
they have fixed sized registers. Input to this system is 16-bit with an internal bus
which is 32-bit wide. Word length of outputs of functional units must be reduced
21
3.2 Outward/Directed Rounding
for them to be stored in these smaller sized registers. This reduction in word length
is achieved by the rounding operation. The bits of lower significance of the output
are suitably discarded depending on the rounding direction of the operand. This
introduces errors called precision rounding errors. However, Interval Arithmetic
makes sure that the exact result of the operation lies within the output interval.
Provision is made in this system to round the output interval values from 32 bits to
either 24 bits or 16 bits depending on an input provided. This provision is made
to keep the design flexible for different applications.
As discussed earlier, the proposed system is based on two’s complement fixed-
point number representation. Two’s complement number representation greatly
simplifies the rounding algorithm because an identical procedure needs to be fol-
lowed for rounding up positive and negative numbers. Also, a different but identical
procedure is maintained for rounding down positive and negative numbers. IEEE
standard defines four rounding modes viz. round towards nearest, round towards
zero, round towards positive infinity and round towards negative infinity. While ap-
plying to Interval Arithmetic, we are concerned with two cases: rounding towards
positive infinity and rounding towards negative infinity. The algorithms for round-
ing towards positive infinity and negative infinity are explained below with suitable
examples.
• Rounding towards negative infinity.
Rounding towards negative infinity refers to denoting a high precision number
by the greatest machine representable number of low precision but smaller in
value. In fixed-point two’s complement representation, this is achieved by
simply discarding the bits of lower significance. This algorithm holds true for
positive and negative numbers as illustrated by the following examples:
– Positive numbers
6.78125 in the 8:8 format is represented as 00000110.11001000.
In the 8:4 format, it is represented as 00000110.1100, which is 6.75.
22
3.2 Outward/Directed Rounding
– Negative numbers
-6.78125 in the 8:8 format is represented as 11111001.00111000.
In the 8:4 format, it is represented as 11111001.0011, which is -6.8125.
• Rounding towards positive infinity.
Rounding towards positive infinity refers to denoting a high precision number
by the smallest machine representable number of low precision but greater
in value. In fixed-point two’s complement representation, this is achieved by
performing a logical ‘OR’ on the bits of lower significance to be discarded and
then adding this bit to the number to be retained. Once again, this algorithm
holds true for positive and negative numbers as illustrated by the following
examples:
– Positive numbers
6.78125 in the 8:8 format is represented as 00000110.11001000.
In the 8:4 format, it is represented as 00000110.1101, which is 6.8125.
– Negative numbers
-6.78125 in the 8:8 format is represented as 11111001.00111000.
In the 8:4 format, it is represented as 11111001.0100, which is -6.75.
The above examples can be used for rounding of 32 bit fixed-point numbers in the
16:16 format to 24 bit fixed-point numbers in the 16:8 format. A similar procedure
is followed if the output has to be reduced to 16 bits from 32 bits. These examples
cover all aspects of the rounding algorithm. It is called the “Outward Rounding”
or “Directed Rounding” algorithm and is responsible for the validation of results
provided by interval analysis. The study of this algorithm makes it very simple to
design the hardware to implement outward rounding. The proposed design man-
ifests a separate rounding unit which takes inputs from the functional units and
provides the outputs of the system.
After getting acquainted with the design specifications of the ALU, I now proceed
to give a comprehensive description of its architecture. A few different architectures
23
3.2 Outward/Directed Rounding
have been explored to arrive at an optimum solution.
24
Chapter 4
Hardware Architecture
This chapter of the thesis contains a description of all the modules that constitute
the Interval-ALU. It gives details of the logic design at the gate level for the whole
system, one module at a time. The hardware model at the RTL level of abstraction
is built from these logic designs. Since throughput is the main performance metric
to be optimized, the logic is designed with reduction of the critical path delay in
mind. Several pipelined versions of the design are built along with the basic non-
pipelined one to improve the throughput. I begin with the top level block diagram
of the design and then go into the details of each module.
4.1 Overall Architecture
The overall architecture of the ALU can be seen in the block diagram shown in
Figure 4.1. The hardware model is divided into four parts, viz. the flag generator,
lower bound and upper bound modules, and the rounding unit. The flag generator
module is responsible for generating the control signals for the more complicated
multiplication operation. As the name suggests, the lower bound module and the
upper bound module calculate the lower and upper bounds of the output interval,
25
4.1 Overall Architecture
respectively. These two modules are independent of each other and hence operate in
parallel. The rounding unit implements the Outward Rounding algorithm explained
earlier.
Figure 4.1: Top Level Block Diagram of the ALU
The ALU is designed for operation on 16 bit input interval numbers in the two’s
complement form. The ALU has an input line that allows selection of the multiply-
accumulate mode, acc select. The ALU operates in the accumulate mode as long
as this line is held high. Another input line, rctl, determines the number of output
bits. The output is rounded to 24 bits when this line is held high and 16 bits when
this line is held low. Table 4.1 lists all the inputs to the ALU.
The ALU has two 24-bit output lines that represent the lower and upper bounds
of the resulting interval. Besides this, there are output lines to highlight the special
case in multiplication, the union of two disjoint sets as well as the intersection of
26
4.2 Flag Generator Module
Table 4.1: Description of ALU Inputs
Input Description Bit Width
xL Lower bound on left-hand operand 16 bits
xU Upper bound on left-hand operand 16 bits
yL Lower bound on right-hand operand 16 bits
yU Upper bound on right-hand operand 16 bits
command Mathematical operation to be performed 3 bits
acc select Perform MAC when asserted 1 bit
rctl Width of output results (16 or 24-bits) 1 bit
two disjoint sets. A further explanation of these output lines is provided in the
following sections. Table 4.2 lists all the outputs of the ALU.
Table 4.2: Description of ALU outputs
Output Description Bit Width
zL Lower bound on result 24 bits
zU Upper bound on result 24 bits
next Valid results on output lines 1 bit
union Union of disjoint sets 1 bit
empty Intersection of disjoint sets 1 bit
4.2 Flag Generator Module
A major significance of the design is the reduction in the number of multiplica-
tions performed to evaluate the result of an interval multiplication operation. The
flag generator module forms the control logic for the multiplication operation. It
generates the necessary control signals to determine one of the nine cases in multi-
27
4.3 Lower Bound and Upper Bound Modules
plication. Based on the values of the input operands, it generates a 4 bit mul flag
which selects among the nine cases. Table 4.3 shows the case to be selected based
on the value on the mul flag.
Table 4.3: mul flag for the Multiplication operation
mul Case Result
0001 xL ≥ 0; yL ≥ 0 [xLyL,xUyU ]
0010 xL ≥ 0; yL < 0 < yU [xUyL,xUyU ]
0011 xL ≥ 0; yU ≤ 0 [xUyL,xLyU ]
0100 xL < 0 < xU ; yL ≥ 0 [xLyU ,xUyU ]
0101 xL < 0 < xU ; yU ≤ 0 [xUyL,xLyL]
0110 xU ≤ 0; yL ≥ 0 [xLyU ,xUyL]
0111 xU ≤ 0; yL < 0 < yU [xLyU ,xLyL]
1000 xU ≤ 0; yU ≤ 0 [xUyU ,xLyU ]
0000 xL < 0 < xU ;yL < 0 < yU [min(xUyL,xLyU),
max(xLyL,xUyU)]
The logic behind the generation of this flag is shown in Figure 4.2. Table 4.4
explains the generation of the various flag signals used in Figure 4.2. As shown in
the table, the flag signals are generated based on the values of the input operands.
These flags are used to obtain the mul signal based on which the inputs to the
multipliers in the main functional units are selected. This reduces the number of
multiplications to be performed.
4.3 Lower Bound and Upper Bound Modules
The lower bound and the upper bound modules have very similar hardware
structures. The primary difference is the inputs that drive its functional units.
These modules form the heart of the ALU as most of the logic involved in the design
28
4.3 Lower Bound and Upper Bound Modules
Table 4.4: Flag Generation
Condition Flag Generated
xL ≥ 0 flag 1
yL ≥ 0 flag 2
xU ≤ 0 flag 3
yU ≤ 0 flag 4
(xL < 0)&&(xU > 0) flag 5
(yL < 0)&&(yU > 0) flag 6
Figure 4.2: Flag Generation Module
is concentrated in them. Both modules are independent of each others operation
which makes the working of the ALU dependent only on the input signals and not
to its internal hardware. Each module is characterized by dedicated hardware to
29
4.3 Lower Bound and Upper Bound Modules
perform the accumulate operation. This is an important feature of the ALU because
dot product which is implemented through the multiply-accumulate operation forms
the core requirement of any DSP processor. Figure 4.3 and Figure 4.4 display the
basic block diagram of each of the two modules. As seen in the block diagrams, the
two modules are very similar in architecture. However, it is important to note the
different status lines generated by different portions of the modules.
Figure 4.3: Block Diagram of Lower Bound Module
Figure 4.4: Block Diagram of Upper Bound Module
In a non-pipelined architecture, the circuit performance is determined by these
modules of the design. Critical path of a design may be defined as the single slowest
feasible path contained in the design. Greater the logic depth, longer is the critical
path, lower is the frequency at which the circuit can operate and thus lower is the
30
4.3 Lower Bound and Upper Bound Modules
throughput. Since it forms the critical path, a significant effort needs to be put in
to optimize it. Pipelining provides the ideal solution to improve the throughput
problem and is discussed in detail beginning section 4.5. I will now go through each
of the individual blocks that make up the ALU as shown in Figure .
4.3.1 Functional Units and Control Logic
The combinational logic required to perform arithmetic operations on interval
numbers is located in the functional unit block. Apart from the difference in a few
output status lines, the hardware for functional units in each of the two modules is
identical. Each module has an adder/subtracter, a multiplier and other combinato-
rial logic to implement the set operations. Figure 4.5 shows the functional unit in
the lower bound module. It is important to note that the inputs to the adder are
the two lower bounds of the input operands, while the inputs to the subtractor are
the lower bound of the first input operand and the upper bound of the second. The
type of operation and the mul signal are used as controls to determine the outputs
of the functional units. The status line empty is generated by this portion of the
design to indicate the intersection of two disjoint sets.
Figure 4.6 shows the functional unit in the upper bound module. The inputs
to the adder and subtractor are different for this module than the lower bound
module. The upper bounds of both the input operands drive the adder, while the
upper bound of the first input operand and the lower bound of the second input
operand drive the subtractor. Once again, the mul flag determines the inputs to the
multiplier and the outputs of the union and intersection set operations. The status
line union is generated by this module which indicates the union of two disjoint
sets.
The outputs of the arithmetic units are given to the special case multiplication
block. The next section describes the details of the special case multiplication block.
31
4.3 Lower Bound and Upper Bound Modules
Figure 4.5: Lower Bound Module
4.3.2 Special Case Multiplication Block
The situation in which both input operands include zero in their intervals rep-
resents a special case, and is referred to as the ‘Special Case Multiplication’. In
contrast to the normal cases where interval multiplication requires two multiplica-
tions to be performed, comparison between two multiplied values is required for each
32
4.3 Lower Bound and Upper Bound Modules
Figure 4.6: Upper Bound Module
of the bounds in this special case. Hence we require a memory element which would
store a value and make it available for comparison with the next available value. It
requires two multiplications and one comparison to be performed to obtain each of
the two bounds. The hardware to determine the lower bound and the upper bound
is identical and is repeated in both the modules. A status line next is taken from this
block in the lower bound module to indicate the special case. Figure 4.7 shows the
33
4.3 Lower Bound and Upper Bound Modules
hardware architecture of this block. As seen in the diagram, the left/right out c
line coming in from the functional unit block is stored as left/right out r and used
for comparison. The result of this comparison is used only for the interval multi-
plication operation when the mul flag is 0000. The minimum value is selected for
the lower bound module and the maximum value is selected for the upper bound
module. The special case multiplication may lead to synchronization problems if
not dealt with properly.
Figure 4.7: Special Case Multiplication
4.3.3 Multiply-Accumulate Block
A multiply-accumulator forms a very important part of the ALU, more so, be-
cause it is intended for use in DSP applications. DSP applications are characterized
by repetitive multiply-add operations that are computed by use of a dot product.
34
4.4 Rounding Unit
Mathematically, the dot product can be calculated as:
a · b =∑n
i=0 aibi
This operation can be readily performed by a multiply accumulate block. Figure
4.8 shows the hardware architecture of the multiply-accumulate block. As we can
see in the figure, it consists of an adder and a memory element which acts as the
accumulator. An input line acc select determines whether the output needs to be
accumulated or not. When high, the block is in accumulate mode. The output of
this block is 32 bits long and is given to the rounding unit, where outward rounding
is implemented.
Figure 4.8: Multiply-Accumulate Module
4.4 Rounding Unit
The rounding unit forms a critical part of the Interval-ALU. This unit performs
the outward rounding, which guarantees the result of a computation to lie within
the output interval. The proposed design has provision to round a 32 bit output
from the previous block to a 24 bit or a 16 bit word depending on the application
35
4.4 Rounding Unit
for which it is going to be used. An input line, rctl, determines the number of bits to
which the output has to be rounded. When this line is high, the output is rounded
to 24 bits, else it is rounded to 16 bits. The outward rounding algorithm has been
discussed in section 3.2, hence this section concentrates on describing the hardware
to implement these rounding modes. Figure 4.9 shows the architecture for rounding
the output of the lower bound module.
Figure 4.9: Lower Bound Rounding
From the figure we can see that 8 or 16 bits of lower significance are discarded
based on the rctl input, and the higher 24 bits or 16 bits are retained.
The rounding operation for the upper bound module is slightly more complicated
as compared to the lower bound module. In this case, the bits of lower significance
are not simply discarded, but are logically ‘OR’ed and the resultant bit is added to
the bits to be retained. If the rctl line is high, the last 8 bits are logically ‘OR’ed
and added to the remaining 24 bits. On the other hand, if the rctl line is low, the
36
4.4 Rounding Unit
last 16 bits are logically ‘OR’ed and added to the 16 bits of high significance. Figure
4.10 shows the architecture for rounding the output of the upper bound module.
Figure 4.10: Upper Bound Rounding
This completes the architecture of the entire ALU. From the architecture, it is
safe to say that the lower bound module and the upper bound module are the
critical modules in the design. Maximum logic is concentrated in them and hence it
is important to optimize them to a high degree. The following section concentrates
on optimizing these critical modules to improve the throughput of the system, by
pipelining the design to the highest degree. It provides details of several pipelined
versions of the I-ALU.
37
4.5 Pipeline Architecture of the Design
4.5 Pipeline Architecture of the Design
This section presents several versions of the I-ALU, pipelined to various degrees
so as to achieve maximum throughput. Pipelining is a technique used to reduce the
critical path of the circuit and hence improve the speed at which the circuit can
operate. Increase in throughput for any circuit by implementing pipelining comes at
the cost of increased area, increased power dissipation and increased initial latency.
Although the technique of pipelining may portend to have more disadvantages than
advantages, in DSP systems where throughput is of prime importance, it goes a
long way in improving the efficiency of the system. As seen earlier, DSP systems
are characterized by several multiplication and addition instructions. A multiplier
involving combinational logic alone has a very high logic depth and is one of the
main candidates that forms the critical path. Hence, to reduce this logic depth, it is
necessary to pipeline the multiplier. However, the depth to which a multiplier can
be pipelined saturates at a certain point and then it becomes necessary to further
pipeline the design to improve its efficiency. The following sections discuss about
these pipelining techniques in further detail.
4.5.1 Need for Pipelining
In the proposed I-ALU design, synthesis results have shown that without any
level of pipelining, the lower bound module and the upper bound module form the
critical path in the design. Figure 4.11 shows the critical path in these modules.
The diagram shows that the critical path traverses some control logic, a multiplier,
followed by some more combinational logic and then an adder. The huge path forces
the design to work at lower clock frequencies. The main portion of the logic in this
critical path is that of the multiplier. There would be a significant decrease in the
clock period if this portion of the logic were to be pipelined. Implementation of a
few pipelined multiplier architectures is reported in the following section.
38
4.5 Pipeline Architecture of the Design
Figure 4.11: Critical Path
4.5.2 Partially Pipelined Design
Partially pipelined architecture refers to replacing the multiplier formed by com-
binational logic alone with a pipelined multiplier architecture. The design tool ‘Syn-
opsys’ provides several pipelined multiplier architectures in its library that can be
used [33]. A significant increase in the circuit performance is observed from the use
of these Design-ware IP blocks. However, this improvement in performance comes
at the cost of an increased area and power dissipation. Hence, a suitable trade-off
needs to be done to choose the best design. Figure 4.12 shows the architecture of
an non-pipelined multiplier, while Figure 4.13, Figure 4.14, Figure 4.15 and Figure
4.16 show an abstract architecture of a two-level, three-level, four-level and five-level
deep pipelined multipliers, respectively.
As seen in Figure 4.12, the cloud of combinational logic in a non-pipelined mul-
tiplier is large. Pipelining introduces registers in the path of combinational logic
39
4.5 Pipeline Architecture of the Design
Figure 4.12: Non-Pipelined Multiplier Architecture
Figure 4.13: Two-stage Pipelined Multiplier Architecture
Figure 4.14: Three-stage Pipelined Multiplier Architecture
thereby reducing the critical path. The subsequent figures show that as the number
of stages of pipelining increase, the cloud size between two consecutive registers
decreases. Hence each pipelined multiplier operates at a faster clock than the previ-
ous. However, when one such pipeline multiplier is included as a part of the circuit,
after a certain level of pipelining, the multiplier ceases to be a part of the critical
path. Instead, the other combinational logic involving adders and multiplexors form
40
4.5 Pipeline Architecture of the Design
Figure 4.15: Four-stage Pipelined Multiplier Architecture
Figure 4.16: Five-stage Pipelined Multiplier Architecture
the critical path. Thus, a pipelined multiplier contributes in a big way to improve
the throughput of a system, but to obtain more performance improvement, further
pipelining of the design is required, as discussed in the following section.
4.5.3 Highly Pipelined Design
A highly pipelined design involves the use of a pipelined multiplier which pro-
vides the best results for this design. As results in chapter 5 indicate, the perfor-
mance of the ALU saturates for pipelined multipliers with more than 3 stages. All
architectures that employ pipelined multipliers of more than 3 stages operate at ap-
proximately the same clock frequency. This is because after 3 stages, the multiplier
ceases to be a part of the critical path, but the maximum logic depth is formed
by other combinational logic involving multiplexors and adders as shown in Figure
4.11. For further performance enhancement, it is necessary to reduce this cloud of
combinational logic. Thus, improvement in the performance of the circuit would
be achieved by splitting this cloud of combinational logic into smaller clouds, apart
41
4.5 Pipeline Architecture of the Design
from using a three stage pipelined multiplier. Figure 4.17 shows the logic diagram
for this highly optimized design.
Figure 4.17: Highly Pipelined Architecture
The figure shows that several registers are included in the critical path to reduce
its length. The combinational logic between any two consecutive registers is reduced
to some control logic or simply an adder. This produces a significant decrease in
the overall clock period of the circuit as would be evident from the synthesis results.
The amount of combinational logic is maximum in multipliers, followed by adders
and then the control logic, amongst the combinational logic blocks used in this
design. Hence it is important to introduce a register in the path prior to the adder
to reduce the clock period. Introduction of this register brings down the clock period
from 17.73 ns to 3.55 ns. However, introduction of two more stages of pipeline in
the control logic further decreases the clock period to 3.25 ns, as shall be seen in
the following section. This improvement in performance from 3.55 ns to 3.25 ns, a
9.23% decrease, comes at a 4.97% increase in area of the design.
42
4.5 Pipeline Architecture of the Design
In conclusion, this chapter forms the heart of the work presented in this thesis.
It provides a detailed explanation of the architecture of the overall design. Every
module and its functionality has been explained comprehensively. One of the most
important optimizations for the circuit by way of pipelining has been discussed.
During this course, the role played by multipliers in governing the performance
of the ALU has been mentioned and finally, a highly pipelined architecture which
employs a three stage multiplier will be shown to be the best design based on
synthesis results. The next chapter provides statistical results obtained from various
simulation and synthesis runs. Throughput, area and power dissipation have been
obtained for all previously described architectures.
43
Chapter 5
Testing and Results
This chapter provides a comparison of timing, area and power dissipation made
between the different architectures of the I-ALU based on statistical data obtained
from simulation and synthesis results. Based on its significance for DSP applica-
tions, throughput has been considered as the prime performance metric to analyze
the various designs and come up with an optimum solution. Efforts to improve the
throughput of the system have an adverse effect on the area, latency and power
dissipation of each design. Tabulations and graphical aids have been used to show
the comparisons between these metrics for various architectures.
The functionality of the design was verified by running simulations in the Cadence
environment. Verilog HDL was used to capture the behavior of the ALU while
Synopsys was used for synthesis purposes. The 0.18 µm ‘OSU Standard Cell Library’
was used while synthesizing the various modules. Synopsys Design Compiler [34]
was used for timing analysis and Synopsys PrimePower [35] was used to obtain the
power dissipation results.
44
5.1 Simulation Results
5.1 Simulation Results
Verification of the operation was done by running simulations on the design
for 100% code coverage. All possible input combinations were considered and the
results from the simulation runs were compared with the expected values. A special
note needs to be made of three cases; the union of two disjoint sets requires two clock
cycles, the intersection of two disjoint sets results into a null set and the special case
multiplication requires two clock cycles to obtain the result as against one clock cycle
required for all other operations. Figure 5.1 shows these results for the addition,
subtraction and multiplication operations. As seen in the figure, the results of
addition, subtraction and multiplication were obtained after one clock cycle each
i.e. this design has a latency of 1. However, for the special case multiplication,
the output was obtained only after two clock cycles. The status line next goes
high indicating the occurrence of this case and informing that the actual output
is available in the next clock cycle instead of the current. Also, as soon as the
acc select line goes high, the ALU goes into the accumulate mode as can be seen
from the simulation results.
Figure 5.2 shows the simulation results for the interval union and intersection
operations. Apart from the usual behavior of union and intersection, the two special
cases are worth noting. The union of two disjoint sets results in two output intervals
on two consecutive clock cycles, and this is appropriately indicated by the union
status line. For the case of intersection of two disjoint sets, the empty status line
goes high indicating a null set, in which case the interval output is considered invalid.
5.2 Synthesis Results
The different architectures described in the previous chapter have been synthe-
sized. From the synthesis results, a comparative analysis in terms of throughput,
45
5.2 Synthesis Results
Figure 5.1: Simulation Results for Add, Subtract and Multiply
Figure 5.2: Simulation Results for Interval Union and Intersection
area and power dissipation has been done and presented in this section in the fol-
lowing way: Sections 5.2.1 to 5.2.3 contain the results for timing and area of the
various architectures. Section 5.2.1 gives the results for the most primitive design,
the non-pipelined version. Section 5.2.2 gives the comparative results for the various
46
5.2 Synthesis Results
architectures using pipelined multipliers of increasing depths. Section 5.2.3 presents
the results for the highly pipelined architecture which happens to be the most op-
timized design amongst those discussed as far as throughput is concerned. While
performing these tests on the overall designs of these architectures, each individual
lower level modules viz. the flag generator, the lower bound module, the upper bound
module and the rounding unit comprising the overall design were optimized for best
performance. These optimized lower level modules were then used to obtain the
results for the highest level module. This ensures that the results obtained are the
best possible. Section 5.3 gives the results of power dissipation.
5.2.1 Non-Pipelined Architecture
The timing results of the optimized modules as reported by Synopsys for the
non-pipelined architecture are tabulated in Table 5.1. A graphical representation of
these results, shown in Figure 5.3, gives a better understanding of the distribution
of logic in the various modules. The obtained results are found to be coherent with
the expected results.
Table 5.1: Timing Reports for the Non-Pipelined Architecture
No. Module Name Timing Report
1 Flag Generator Module 3.30 ns
2 Lower Bound Module 17.79 ns
3 Upper Bound Module 17.29 ns
4 Lower Bound Rounding 2.71 ns
5 Upper Bound Rounding 3.42 ns
6 Overall Architecture 17.73 ns
As specified in the previous section, the critical path of the design is either in
the lower bound or the upper bound modules. This can aptly be seen from these
results. The two modules have the maximum concentration of combinational logic
47
5.2 Synthesis Results
and hence are responsible for the timing of the overall circuit. In comparison to
these modules, the other modules have fewer combinational blocks and hence can
operate at faster speeds. The values listed in the table are as reported by Synopsys.
A lower value of the clock period is achieved for the overall design as compared to
the lower level modules because the tool re-optimizes the timing of all the modules
to achieve the best performance while synthesizing the top level module. Figure 5.3
is a graphical representation of these results.
Timing Reports for Non-Pipelined Design
3.3
17.79 17.73
2.713.42
17.29
0
2
4
6
8
10
12
14
16
18
20
Flag Lower Bound Upper Bound Lower Round Upper Round Overall DesignModule
Tim
ing
(in n
s)
Timing Reports for Non-Pipelined Design
Figure 5.3: Timing Reports of Non-Pipelined Architecture
Table 5.2 gives the area reports of all the lower level and upper level modules.
Figure 5.4 is a graphical representation of these results. As expected, most of the
area of the design is formed by the lower and upper bound modules. The area of
48
5.2 Synthesis Results
the overall design is approximately the sum of the areas of the individual modules.
Table 5.2: Area Reports of Non-Pipelined Architecture
No. Module Name Area Report
1 Flag Generator Module 8,964 µm2
2 Lower Bound Module 116,502 µm2
3 Upper Bound Module 110,855 µm2
4 Lower Bound Rounding 6,096 µm2
5 Upper Bound Rounding 11,059 µm2
6 Overall Architecture 253,476 µm2
From the results in Table 5.3, the minimum clock period required for the opera-
tion of circuit is 17.73 ns. Thus, the maximum frequency at which it can operate
is 56.401 MHz. This is a considerably low frequency of operation and hence rep-
resents a design with low throughput. Design efforts need to be directed towards
improving the throughout of the non-pipelined design. This is achieved by using a
pipeline architecture. The following sections show the results for various pipeline
architectures. It is important to note that the area of this design is 253,476 µm2
and the efforts to improve the circuit are going to have direct impact on the area.
The area of the design increases as the number of pipeline stages increase.
5.2.2 Design with Pipeline Multipliers
As described in the previous chapter, the purpose of using pipeline multipliers is
to reduce the critical path in the design. The critical path decreases as the number
of stages of the multiplier increase. This is evident from the results tabulated in
Table 5.3. It shows the minimum clock period for the operation of various stages
of the pipeline multipliers. Since the multiplier is a part of the critical path of the
entire design, the use of pipeline multipliers has a direct influence on increasing the
49
5.2 Synthesis Results
Area Report of Non-Pipelined Design
8964
116502 110855
6096 11059
253476
0
50000
100000
150000
200000
250000
300000
Flag Lower Bound Upper Bound Lower Round Upper Round OverallDesign
Module
Are
a (in
um
2 )
Area Report of Non-Pipelined Design
Figure 5.4: Area Reports for the Non-Pipelined Architecture
frequency at which the circuit can operate.
Increasing the number of stages of the pipelined multipliers in the design would
certainly improve the throughput. The multiplier is present in the lower bound
module and the upper bound module. Synthesis runs on these modules clearly
indicate the decrease in the minimum clock period required for their operation.
Apart from these two modules, the flag generator module, the lower bound round
module and the upper bound module will not be affected by the use of various stages
of the pipeline multiplier. Table 5.4 shows the timing reports for the two significant
modules and the effect of using pipelined multipliers on the overall design.
50
5.2 Synthesis Results
Table 5.3: Timing Reports for the Pipelined Multipliers
Pipeline Multiplier Minimum Clock Period
Two Stage 6.52 ns
Three Stage 3.38 ns
Four Stage 2.38 ns
Five Stage 2.38 ns
Table 5.4: Timing Reports for various Pipelined Architectures
Pipeline Stage Lower Module Upper Module Overall Design
Non-Pipelined 17.79 ns 17.29 ns 17.73 ns
Two Stage 7.75 ns 7.51 ns 7.75 ns
Three Stage 4.75 ns 4.67 ns 4.75 ns
Four Stage 4.67 ns 4.57 ns 4.67 ns
Five Stage 4.78 ns 4.60 ns 4.78 ns
51
5.2 Synthesis Results
Figure 5.5 plots the values of the minimum clock frequency for the different overall
designs. From the graph, it can be observed that there is a significant improvement
in the timing of the design until the use of a three-stage pipelined multiplier. There
is almost a 100% reduction in the clock period because of the use a two-stage
pipelined multiplier from a non-pipelined design. A further significant decrease
in the clock period is observed with the use of a three stage pipelined multiplier.
However, increases in the level of pipeline beyond the third stage does not change the
timing of the overall design by a significant amount. This is because, the multiplier
ceases to be a part of the critical path now. The critical path is formed by other
combinational logic in the design.
The area of the design is directly proportional to an increase in the number of
pipeline stages. Hence, following the law of diminishing returns, a design with a
three stage pipelined multiplier would be the best one among the various designs
described. Table 5.5 gives area reports of the lower bound module, upper bound
module and the overall architectures for these designs. Figure 5.6 is a plot of the
areas of the different architectures.
Table 5.5: Area Reports for various Pipelined Architectures
Pipeline Stage Lower Module Upper Module Overall Design
Non-Pipelined 116,502 µm2 110,855 µm2 253,476 µm2
Two Stage 172,918 µm2 173,370 ns 372,407 µm2
Three Stage 211,968 µm2 212,029 µm2 450,116 µm2
Four Stage 248,847 µm2 246,860 µm2 521,826 µm2
Five Stage 278,246 µm2 278,897 ns 583,262 µm2
From the timing and area analysis, it is clear that there needs to be trade-off
between the two performance metrics; delay and area, to determine the better
design. From the timing plot, we come to the conclusion that the improvement
52
5.2 Synthesis Results
Timing Report of Pipelined Architectures
7.75
17.73
4.784.674.75
0
2
4
6
8
10
12
14
16
18
20
Non-Pipelined Two Stage Three Stage Four Stage Five Stage
Pipeline stage
Tim
ing
(in n
s)
Timing Report of Pipelined Architectures
Figure 5.5: Timing Reports of Different Pipelined Architecture
in performance of the design saturates beyond the use of a three stage pipelined
multiplier. With every stage, there is a significant increase in the area of the design.
Figure 5.7 is a plot of area and timing on the y-axis of the same graph for the
various designs. From this plot, we can firmly say that it is best to use a three stage
pipelined multiplier in the proposed design.
53
5.2 Synthesis Results
Area Reports of Different Pipelined Architectures
583,262
521,826
450,116
372,407
253,476
0
100,000
200,000
300,000
400,000
500,000
600,000
700,000
Non-Pipelined Two Stage Three Stage Four Stage Five Stage
Pipeline Stage
Are
a (in
um
2 )
Area Reports of Different Pipelined Architectures
Figure 5.6: Area Reports of Different Pipelined Architecture
5.2.3 Highly Pipelined Design
The architecture of a highly pipelined design was shown in the previous chapter.
It uses a three stage pipelined multiplier and the remaining combinational logic is
pipelined to the highest degree to achieve maximum performance. Table 5.6 com-
pares the timing of a highly pipelined design to a non-pipelined design. A significant
improvement in the design is observed as far as the the main performance metric
is concerned. Figure 5.8 is a graphical representation of the values in Table 5.6.
An improvement of the order of 500% in the performance of the pipelined architec-
ture over the non-pipelined architecture is observed. Considering the importance
of throughput for DSP applications, this is a favorable trade-off when compared to
the increase in the area of the design. Table 5.7 shows the comparison between the
54
5.2 Synthesis Results
Timing and Area Reports of Various Architectures
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
2000000
Non-Pipelined Two Stage Three Stage Four Stage Five Stage
Pipeline Stages
Timing Reports for various Architectures (10e-5 ns) Area Reports for various Architectures(um^2)
Figure 5.7: Timing and Area Reports of Different Pipelined Architectures
frequency of operation, the upper bound on the throughput and the initial latency
in a non-pipelined and a highly pipelined design.
Finally, Table 5.8 shows the comparison between the areas of individual modules
in the non-pipelined and the highly-pipelined designs. There is considerable amount
of increase in the area for the highly-pipelined design. However, with the knowledge
that area being a secondary performance metric for the design, the improvement in
the throughput of the highly-pipelined design outweighs the disadvantage brought
about by the increase in area. Figure 5.9 is a graphical representation of the values
in Table 5.8.
55
5.2 Synthesis Results
Table 5.6: Timing Reports for Non-Pipelined and Highly-Pipelined Architectures
Module Name Non-Pipelined Highly Pipelined % Improvement
Flag Generator 3.30 ns 3.30 ns 0.00
Lower Bound 17.79 ns 3.25 ns 547.38
Upper Bound 17.29 ns 3.25 ns 532.00
Lower Round 2.71 ns 2.71 ns 0.00
Upper Round 3.42 ns 3.42 ns 0.00
Overall Design 17.73 ns 3.25 ns 545.54
Table 5.7: Results for Non-Pipelined and Highly-Pipelined Architectures
Design Frequency Throughput Latency
Non-Pipelined 56.401 MHz 56 × 106 2
Highly-Pipelined 307.692 MHz 307 × 106 7
Table 5.8: Area Reports for Non-Pipelined and Highly-Pipelined Architectures
Module Name Non-Pipelined Highly Pipelined % Increase in Area
Flag Generator 8964 µm2 8964 µm2 0.00
Lower Bound 116502 µm2 262349 µm2 125.18
Upper Bound 110855 µm2 260793 µm2 135.25
Lower Round 6096 µm2 6096 µm2 0.00
Upper Round 11059 µm2 11059 µm2 0.00
Overall Design 253476 µm2 548899 µm2 116.54
56
5.3 Power Analysis
Timing Analysis of Non-Pipelined and Highly-Pipelined Design
17.79
2.71
17.73
2.71
3.3 3.42
17.29
3.3 3.253.423.25 3.25
0
2
4
6
8
10
12
14
16
18
20
Flag Lower Bound Upper Bound Lower Round Upper Round Overall Design
Module Names
Tim
ing
(in n
s)
Timing for Non-Pipelined Design (in ns) Timing for Highly Pipelined Design (in ns)
Figure 5.8: Timing Reports of Non-Pipelined and Highly-Pipelined Architectures
5.3 Power Analysis
According to Moore’s Law, the number of transistors on a chip double every
18 months. Advancements in technology are resulting in more transistors on a
wafer, increasing wafer sizes with the cost per wafer remaining approximately the
same. Amidst these improvements, the limiting factor has been the heating of the
chips due to excessive power dissipation. Efforts in designing the logic can improve
the throughput of the system, while the upcoming technologies aid in reducing the
overall area of the design. In these circumstances, power dissipation becomes a
critical performance metric.
57
5.3 Power Analysis
Area Reports for Non-Pipelined and Highly Pipelined Designs
0
100000
200000
300000
400000
500000
600000
Flag Lower Bound Upper Bound Lower Round Upper Round Overall Design
Module Names
Are
a (in
um
2 )
Area Report for Non-Pipelined Design (in um^2) Area Report for Highly-Pipelined Design (in um^2)
Figure 5.9: Area Reports of Non-Pipelined and Highly-Pipelined Architectures
The power analysis is done by using the Synopsys Power Tool, PrimePower.
PrimePower is a full-chip, dynamic power analysis tool that works at the gate level.
Its high-capacity power analysis supports industry-standard synthesis libraries and
provides the power analysis details needed to meet power specifications, while re-
ducing packaging costs and excess power consumption. It is an improvement over
Synopsys’ existing power tool, Design Power. One of its advantage is its ability to
handle designs with potential capacities of up to 10 million gates. PrimePower mod-
els pattern-dependent, capacitive switching, short-circuit and static power consump-
tion, considers instance-specific cell-state dependencies, glitches, multiple loads and
nonlinear ramp effects. To use PrimePower, the design was first synthesized to
generate a netlist file. Simulations were then run on this synthesized design with
58
5.3 Power Analysis
input vectors in excess of 1000. The resultant .vcd file contains information of the
switching activity as a result of the input vectors. This information was used by
PrimePower to make an estimate of the power dissipation. PrimePower outputs
text reports, giving a breakdown of power dissipation, and graphical reports that
can include histograms and waveforms.
This section provides results of the analysis done on different architectures of the
ALU with varying number of input vectors using PrimePower.
5.3.1 Generating Input Vectors
Automation of the power tool was done using the SSHAFT scripts. For the spec-
ified input vectors, the script determines the number of 1 → 0 and 0 → 1 transitions
for approximating the power dissipation. Hence, it is important to have as many
input vectors as possible to get an acceptable approximate of the power dissipation.
These input vectors were randomly generated using the MATLAB function rand.
The following code in MATLAB generates these input vectors:
close all;
clc;
ra = round((216 − 1) ∗ (rand(1, 500)− .5));
rb = ra + 2;
rc = round((216 − 1) ∗ (rand(1, 500)− .5));
rd = rc + 2;
The rand function in the above code generates 500 values for ra and rc between 0
and 1. These randomly generated values were then scaled appropriately to values
between -0.5 and 0.5 and represented by 16 bits. [ra, rb] and [rc, rd] form the
two input intervals. The upper bounds of these intervals were determined as shown
above since they have to be greater than (or equal to) their respective lower bounds.
59
5.3 Power Analysis
Further processing was done on the input vectors so that the special case multipli-
cation and the case of union of two disjoint sets was accounted for. The number of
random vectors generated can be varied by simply changing the arguments of the
rand function. I have analyzed the power dissipation of all designs using 500, 1000
and 2000 vectors. The results are stated in the following section.
5.3.2 Statistical Results from Power Scripts
The power dissipation is expected to increase with increase in the number of
pipeline stages. This is because the switching activity increases with the increase
in the number of registers. Table 5.9 gives the values of power dissipated in mwatts
for different architectures with 500 input vectors. Figure 5.10 is a graphical rep-
resentation of these values. As seen from the plot, the power dissipation increases
with increasing number of pipeline stages.
Table 5.9: Power Dissipation for Different Architectures with 500 Input Vectors
Pipeline Multiplier Power Dissipated (in mwatts)
Non-pipelined 2.655× 10−2
Two Stage 2.961× 10−2
Three Stage 3.443× 10−2
Four Stage 3.881× 10−2
Five Stage 4.327× 10−2
The power dissipation is also expected to increase as the number of input vec-
tors to a design increase. Table 5.10 shows the values for power dissipation of an
architecture that employs a 3-stage pipelined multiplier. Figure 5.11 is a graphical
representation where the expected results are seen.
60
5.3 Power Analysis
Power Dissipation for 500 Input Vectors
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
Non-Pipelined Two Stage Three Stage Four Stage Five Stage
Pipeline Stages
Pow
er D
issi
patio
n (in
mw
atts
)
Power Dissipation for 500 Input Vectors
Figure 5.10: Power Dissipation for 500 Input Vectors
Lastly, Table 5.11 gives the values of power dissipation for all the different archi-
tectures while Figure 5.12 is a graphical representation of the same. The values from
the table prove that the power dissipation increases with the increase in number of
pipeline stages for different number of input vectors. This trend is found to be
consistent throughout the table. The highly pipelined architecture which employs
a 3-stage pipelined multiplier shows higher power dissipation than the partially
pipelined architecture which has the 3 stage multiplier.
Having done extensive analysis on the timing, area and power dissipation of sev-
eral architectures, the ideal way to compare the architectures and arrive upon the
best solution amongst these, would be to have a performance metric which combines
61
5.3 Power Analysis
Table 5.10: Power Dissipation for 3-stage Pipelined Architecture
Number of Vectors Power Dissipation (in mwatts)
500 3.443× 10−2
1000 3.516× 10−2
2000 3.551× 10−2
Power Dissipation for different Input Vectors for 3-stage Pipelined Multiplier Design
0.0338
0.034
0.0342
0.0344
0.0346
0.0348
0.035
0.0352
0.0354
0.0356
500 1000 2000
Number of Input Vectors
Pow
er D
issi
patio
n (in
mw
atts
)
Power Dissipation for different Input Vectors
Figure 5.11: Power Dissipation for different Input Vectors for 3-stage PipelinedMultiplier Design
the effect of all of the above. The performance metric of throughput per unit power
dissipation, is one of the best way to summarize the performance of an architec-
ture. With power analysis gaining importance, this metric has become an industry
62
5.3 Power Analysis
Table 5.11: Power Dissipation for All Stages with different Input Vectors
Pipeline Stages 500 Vectors 1000 Vectors 2000 Vectors
Non-pipelined 2.655× 10−2 2.768× 10−2 2.785× 10−2
Two Stage 2.961× 10−2 3.219× 10−2 3.287× 10−2
Three Stage 3.443× 10−2 3.516× 10−2 3.551× 10−2
Four Stage 3.881× 10−2 3.957× 10−2 3.994× 10−2
Five Stage 4.327× 10−2 4.402× 10−2 4.441× 10−2
Highly Pipelined 4.207× 10−2 4.312× 10−2 4.295× 10−2
standard [36]. Table 5.12 lists these values for all the architectures, whereas Figure
5.13 is a graphical representation of the same. From the figure, it can be clearly
seen that the highly-pipelined design is by far the best architectures among all.
Table 5.12: Throughput per unit Power Dissipation for All Architectures
Architecture Operations/cycle/mwatt
Non-pipelined 20.231× 108
Two-stage 40.07× 108
Three-stage 59.727× 108
Four-stage 54.081× 108
Five-stage 47.478× 108
Highly Pipelined 71.196× 108
In summary, this section has given the results of simulations and synthesis runs
done on various architectures. Functionality of the design was verified by run-
ning simulations for 100% code coverage. Synopsys Design Compiler and Synopsys
PrimePower were the tools used for estimating the timing, area and power dissi-
pation of these architectures. These performance metrics were used to determine
63
5.3 Power Analysis
Power Dissipation of All Architectures for Different Input Vectors
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
500 1000 2000
Number of Input Vectors
Pow
er D
issi
patio
n (in
mw
atts
)
Non-Pipelined Two-stage Pipeline Three-stage PipelineFour-stage Pipeline Five-stage Pipeline Highly Pipelined Design
Figure 5.12: Power Dissipation for different Input Vectors for All Architectures
the best architecture for the design. A highly pipelined architecture with a 3-stage
pipelined multiplier was found to be the best design amongst those discussed. The
following chapter gives a summary of the work presented in the thesis along with
future work.
64
5.3 Power Analysis
Throughput per unit Power Dissipation
20.231
40.07
59.72754.081
47.478
71.196
0
10
20
30
40
50
60
70
80
Non-Pipelined Two Stage Three Stage Four Stage Five Stage Highly-Pipelined
Architecture
Ope
ratio
ns/c
ycle
/mw
att (
x 10
^8)
Throughput per unit Power Dissipation
Figure 5.13: Throughput per unit Power Dissipation for All Architectures
65
Chapter 6
Conclusions and Future Work
6.1 Conclusions
This thesis has presented several architectures for an Arithmetic Logic Unit
(ALU) employing Interval Numbers (I-ALU). The ALU is dedicated for use in ap-
plications related to the DSP and Controls field. Interval arithmetic is one of the
solutions to problems that arise due to rounding of numbers on finite precision
machines. Implementation of Interval Arithmetic in software is a slow process.
Dedicated hardware that performs this arithmetic at speeds comparable to non-
interval ALUs are necessary for this purpose. Fixed-point ALUs have advantages
over floating-point ALUs in terms of cost and complexity of the design. In many
DSP applications, the dynamic range of floating point arithmetic is not required.
Therefore, a faster and lower power fixed point hardware would be a better choice.
The work presented in this thesis gives the basic architecture of an Interval-ALU.
This ALU operates on 16-bit interval numbers in two’s complement form. The ALU
performs basic arithmetic operations of addition, subtraction and multiplication. It
has the capability of performing set operations of union and intersection on the in-
66
6.1 Conclusions
terval numbers as well. Since multiply-accumulate forms an integral part of any DSP
system, dedicated hardware is presented in this design to perform this operation.
Division is not inherently performed by the ALU. Division by degenerate powers
of two can be performed using the shift operation. The proposed design produces
results in one clock cycle, except for cases where the result is a union of two disjoint
intervals or product of two intervals such that both include a zero. Hardware sup-
port is provided to take care of these special cases so that the appropriate results
are achieved and the overall design is uniform from the timing perspective. Errors
that occur due to rounding of numbers to a certain number of bits are accounted for
by implementing the Outward Rounding algorithm in hardware. The ALU has the
provision of rounding the result to either 24-bits or 16-bits based on the application
for which the ALU is to be used.
Having provided with the basic architecture, modifications are done to improve
the performance of the design. Throughput being a critical performance metric,
pipelining is employed to increase the efficiency. Multipliers in the basic design is
replaced by pipelined multipliers of different stages. A design with a three stage
pipelined multiplier is found to be the best design after a trade-off between through-
put and area. To enhance the system performance, the design has been further
pipelined. A highly pipelined design for an Interval Arithmetic Logic Unit shows
significant improvement in throughput as compared to a non-pipelined design. The
performance improvement is achieved by reducing the critical path of the original
design. The synthesis results show a significant increase in the throughput of the
new design. Table 6.1 shows a comparison of the clock period, frequency of opera-
tion, throughput, area and initial latency between the non-pipelined and the highly
pipelined designs. An estimate of power dissipation was also obtained for all of the
architectures. From the values of timing and power dissipation obtained, a final
performance metric of throughput per unit power was evaluated to determine the
best architecture. The table reflects this metric as well. From the data in the table,
it is conclusive that the highly pipelined design is by far the best architecture among
all the architectures discussed.
67
6.2 Future Work
Performance Metric Non-pipelined Highly-pipelined
Clock Period 17.73 ns 3.25 ns
Frequency 56 MHz 307 MHz
Throughput 56× 106 op/cyc 307× 106 op/cyc
Area 253476 µm2 548899 µm2
Latency 2 7
Throughput/Power 20.231× 108 op/cyc/mw 71.196× 108 op/cyc/mw
Table 6.1: Comparison of Non-Pipelined and Highly-Pipelined Designs
6.2 Future Work
The work done in this thesis can be carried forward and put to use in certain
DSP applications. One of the possible applications is adaptive filtering. The ALU
could be employed to do the rigorous calculations needed in a FIR or IIR filter. The
numerical reliability of such interval based adaptive systems is expected to be higher
than normal non-interval systems. A comparison of the hardware specifications of
a non-interval and interval adaptive filters could be done to analyze the extra effort
that needs to be put in to achieve higher numerical reliability. In signal processing,
there exist many instances where the objective function to be minimized is not
convex (e.g. adaptive IIR filtering), whereby optimization to the global optimum
cannot be guaranteed. Interval Analysis has the potential to provide a solution to
such non-convex objective functions. A possible application of the ALU would be
in these global optimization algorithms.
As a part of the future work, the vision is to build hardware to perform complex
operations to evaluate the trigonometric, logarithmic and exponential functions.
This would require highly complex designs. Increased complexity leads to increased
power dissipation. In order to reduce the power dissipation, separate modules can
be built to evaluate these functions based on the frequency of their use. The overall
68
6.2 Future Work
design could be made modular to make it more power efficient. Such complex
modules may be included as a part of the overall design when required and may be
disconnected at other times to minimize power dissipation.
One of the long term goals is to build a co-processor based on interval arithmetic.
This ALU could be used as the heart of this co-processor. A suitable bus architecture
and I/O peripherals would be required for this. The size of the accumulator could be
increased based on the application to allow repetitive multiply-accumulate execution
as well as take care of the overflow. The format in which fixed point numbers are
represented could be based on the application it is being used for. Section 3.1.1
describes the representation of 16 bit fixed point numbers in the 8:8 form. We may
choose the 4:12, 2:14 or the 1:15 form depending on the application for which the
ALU is being used. Also, an aim is to download a major part of this design on to
a FPGA block.
69
Bibliography
[1] ANSI/IEEE, IEEE Standard for Binary Floating-Point Arithmetic, New York:
ANSI/IEEE Std 754-1985, 1985.
[2] M. J. Schulte, V. Zelov, A. Akkas, and J. C. Burley, “ Adding Interval Support
to the GNU Fortran Compiler”, Manuscript, Lehigh University, 1997.
[3] D. Chiriaev and G. W. Walster,“Interval Arithmetic Specification”,
Manuscript, 1997.
[4] Fortran 95 Interval Arithmetic Programming Reference, SUNTM Studio 11,
November 2005, Revision A, Part No. 819-3695-10.
[5] C++ Interval Arithmetic Programming Reference, SUNTM Studio 11, Novem-
ber 2005, Revision A, Part No. 819-3696-10.
[6] U. Kulisch. Advanced Arithmetic for the Digital Computer. New York:
Springer-Verlag, 2002.
[7] M. J. Schulte and E. E. Swartzlander, Jr., “A Family of Variable Precision
Interval Arithmetic Processors.” IEEE Transactions on Computers, vol. 49,
pp. 387-398, May 2000.
[8] Gene Frantz and Ray Simar, “The magazine of record for the embedded com-
puting industry”, November 2004.
[9] M. E. Jerrell, “Global Optimization using Interval Arithmetic”, Computational
Economics, vol. 7, pp. 55-62, 1994.
70
BIBLIOGRAPHY
[10] E. Hansen, Global Optimization using Interval Analysis, Marcel Dekker, 1992.
[11] R. B. Kearfott, A review of techniques in the Verified Solution of Constrained
Global Optimization, vol. 3, 1992.
[12] K. Braune, “Standard Functions for Real and Complex Point and Interval
Arguments with Dynamic Accuracy”, in Scientific Computing with Automatic
Result Verification (U. Kulisch and H. J. Stetter, eds.) pp. 159-184, Springer-
Verlag, 1989.
[13] J. Schroder, “Verification of Polynomial Roots by Closed Coupling of Com-
puter Algebra and Self-Validating Numerical Methods”, in Contributions to
Computer Arithmetic and Self-Validating Numerical Methods, (C. Ullrich, ed.),
pp. 259-269, J. C. Baltzer, 1990.
[14] H. J. Stetter, “Validated Solution of Initial Value Problem for ODE”, in Com-
puter Arithmetic and Self-Validating Numerical Methods, (C. Ullrich, ed.), pp.
171-187, Academic Press, 1990.
[15] A. Neumaier, “Interval Methods for Systems of Equations”, Cambridge Uni-
versity Press, 1990.
[16] R. B. Kearfott, M. Daw ande, K. Du, and C. Hu, “A Portable FORTRAN
77 Elementary Function Library”, Interval Computations, vol. 3, pp. 96-105,
1992.
[17] R. B. Kearfott, M. Daw ande, K. Du, and C. Hu, “Algorithm 737: INTLIB: A
Portable FORTRAN 77 Interval Standard Function Library”, ACM Transac-
tions on Mathematical Software, vol. 20, pp. 447-459, 1994.
[18] O. Knuppel, “PROFIL/BIAS - A Fast Interval Library”, Computing, vol. 53,
pp. 277-288, 1994.
[19] E. Hyvonen and S. D. Pascale, “InC++ Library Family for Interval Computa-
tions”, International Journal of Reliable Computing. Supplement to the Inter-
71
BIBLIOGRAPHY
national Workshop on Applications of Interval Computations, (V. Kreinovich,
ed.) pp. 85-90, 1995.
[20] R. Klatte, U. Kulisch, M. Neaga, D. Ratz, and C. Ullrich, “PASCAL-XSC:
Language Reference with Examples”, Springer-Verlag, 1991.
[21] R. Klatte, U. Kulisch, A. Wiethoff, C. Law o, and M. Rauch, “C-XSC: A C++
Class Library for Extended Scientific Computing.”, Springer-Verlag, 1993.
[22] W. V. Valter, “ACRITH-XSC: A Fortran-like Language for Verified Scientific
Computing” in Scientific Computing with Automatic Result Verification, (E.
Adams and U. Kulisch, eds.), pp. 45-70, Academic Press, 1993.
[23] J. S. Ely, “The VPI Software Package for Variable Precision Interval Arith-
metic” Interval Computations, vol. 2, pp. 135-153, 1993.
[24] M. J. Schulte, V. Zelov, A. Akkas, and J. C. Burley, “ The Interval-Enhanced
GNU Fortran Compiler”, Reliable Computing, vol. 5, no. 3, pp. 311-322, 1999.
[25] R. B. Kearfott et al., “A Specific Proposal for Interval Arithmetic in FOR-
TRAN”, Manuscript, University of Southwestern Louisiana, 1996.
[26] M. J. Schulte, K. C. Bickerstaff, and E. E. Swartzlander, Jr., “Hardware Inter-
val Multipliers”, Journal of Theoretical and Applied Informatics, vol. 3, no. 2,
pp. 73-90, 1996.
[27] J. E. Stine and M. J. Schulte, “A Combined Interval and Floating-point Mul-
tiplier”, in 8th Great Lakes Symposium on VLSI, pp. 208-213, February 1998.
[28] J. E. Stine and M. J. Schulte, “A Combined Interval and Floating-point Di-
vider”, in Thirty-Second Asilomar Conference on Signals, Systems and Com-
puters, pp. 218-222, November 1998.
[29] A. Akkas, “A Combined Interval and Floating-point Comparator/Selector”,
IEEE 13th International Conference on Application-specific Systems, Architec-
tures, and Processors, pp. 208-217, San Jose, USA, July, 2002.
72
BIBLIOGRAPHY
[30] M. J. Schulte and E. E. Swartzlander, Jr., “A Family of Variable-Precision,
Interval Arithmetic Processors”, IEEE Transactions on Computers, vol. 49,
pp. 387-398, May 2000.
[31] R. E. Moore, Interval Analysis, Prentice-Hall Inc., 1966.
[32] U. Kulisch, Advanced Arithmetic for the Digital Computer, New York:
Springer-Verlag, 2002.
[33] http://www.synopsys.com/products/designware/dwcores.html
[34] http : //www.synopsys.com/products/logic/design comp cs.html
[35] http : //www.synopsys.com/products/power/primepower ds.pdf
[36] D.G. Chinnery and K. Keutzer, Closing the Power Gap between ASIC and
Custom: An ASIC Perspective, University of California at Berkeley.
73