analysis and design of low power digital multipliers · figure 1. digital multiplication flow. 21...
TRANSCRIPT
CARNEGIE MELLON UNIVERSITY
Analysis and Design of Low Power Digital Multipliers
A DISSERTATIONSUBMITTED TO THE GRADUATE SCHOOL
IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
for the degree of
DOCTOR OF PHILOSOPHYin
ELECTRICAL AND COMPUTER ENGINEERING
byPascal Constantin Hans Meier
Pittsburgh, PennsylvaniaAugust, 1999
iii
Acknowledgements
This work represents the culmination of several years of hard work exploring and developing ideas in the field of analysis and design of low power digital multipliers. Although this thesis prominently bears my name, it really is the result of the collaborative efforts of a number of peo-ple who have discussed, encouraged, cajoled, and otherwise supported me, through successes and failures, to its ultimate publication. It is to these people that this work is dedicated, and I would like to acknowledge their contributions.
My advisors at Carnegie Mellon University, Professors Rob Rutenbar and L. Richard Carley, were my primary guides in this journey. Professor Rutenbar helped me develop a clear under-standing of CAD theory and practice through his concise expositions in courses and one-on-one discussion. Professor Carley’s expertise in circuit design greatly broadened my design experience, particularly in the ‘art’ of analog circuit design, a knowledge of which becomes most useful to the digital designer who tries to ‘push the envelope’. The third member of my thesis committee, Pro-fessor Larry Pileggi was especially helpful in questioning my ideas, in particular with respect to technology migration issues.
I was fortunate to have two excellent advisors from industry, Dr. Chris Nicol and Dr. John Fish-burn, both from Lucent Technologies. Dr. Nicol’s experience in multiplier power reduction greatly aided in determining promising areas of investigation, and his knowledge of prior work helped me plan the course of this research. Dr. Fishburn’s extensive background in system-level CAD tools provided for a number of interesting and lively discussions on power and delay ten-dencies of various multiplier designs. For their help I am most grateful.
My fellow graduate students were a constant source of support, and I was greatly enriched by their presence in technical discussions, social events and daily activities: Mehmet Aktuna, Bulent Basaran, Sari Coumeri, Prakash Gopalakrishnan, David Guillou, Rony Kay, Michael Krasnicki, Ram Krishnamurthy, Mark Mescher, Joshua Park, Michele Quarantelli, Nick Zayed. I write this list, knowing that I have forgotten someone—you know who you are. The staff of the Electrical and Computer Engineering Department were equally supportive, especially when the challenges came. I owe a debt of gratitude to my surrogate ‘mom’, Lynn Philbin and to my good friend, Lyz Knight. We made it after all.
The struggle of graduate school and numerous other undertakings would not have reach fruition without the support of many friends in Pittsburgh. In particular, the Catholic community in the Oakland and surrounding areas was a source of constant support, and I am particularly indebted to the Oratorian Fathers of the Ryan Catholic Newman Center/Pittsburgh Oratory for guidance dur-ing this time.
Finally I wish to specially thank my parents for all their encouragement, and Peggy Chan, who patiently supported me through this work.
This work was funded by DARPA under contract ADAAL 01-95-K-3527 and the NSF under contracts 942037 and9408457.
Table of Contents
1 Introduction 11.1 Power Reduction in Integrated Circuits 31.2 Integrated Circuit Multiplication 61.3 Delay and Power in Multipliers 71.4 Power vs. Energy 111.5 Research Approach 121.6 Thesis Outline 15
2 Background: Multipliers and Power Dissipation 192.1 Multipliers 19
2.1.1 Multiplier Structure 202.1.2 Partial Product Generation 212.1.3 Partial Product Reduction 232.1.4 Array Style Reduction 262.1.5 Wallace Tree Partial Product Reduction 282.1.6 Partial Product Reduction/Generation Using Booth Recoding 302.1.7 Final Adder 332.1.8 Summary 38
2.2 CMOS Power 382.2.1 Power Optimization Fundamentals 42
2.3 Multiplier Power Reduction 492.3.1 Logic Level Multiplier Optimization 512.3.2 Architectural Power Optimization 542.3.3 Low Power Multiplication Research 55
2.4 Summary 57
3 Power Trade-offs in Array and Wallace Tree Multipliers 593.1 Introduction 59
3.1.1 Partial Product Reduction Schemes 603.1.2 Analysis of Switching Behavior 613.1.3 Initial Investigations 65
3.2 An Improved Analysis methodology 663.2.1 Generation of Component Library 673.2.2 Circuitry 68
Analysis and Design of Low Power Multipliers iv
Table of Contents
3.2.3 Cell Characterization 693.2.4 Partial Product Reduction Generators 713.2.5 Adder Generators 723.2.6 Layout Model and Wallace Tree Placement 723.2.7 Power and Delay Estimation 74
3.3 Experimental Results 753.3.1 Layout Characteristics 763.3.2 Energy Per Operation 763.3.3 Delay 773.3.4 Further Modelling Refinements 80
3.4 Summary 81
4 Minimizing Switching Activity By Latch Insertion 834.1 Introduction 834.2 False Switching in Multipliers 84
4.2.1 Input Latching 854.2.2 Latching the Signal Path 874.2.3 Previous Work 894.2.4 General Principles of TRB Insertion 91
4.3 Latches as Transition Retaining Barriers in Wallace trees 954.3.1 Placement of Latches 954.3.2 Wallace Tree Latch Placement 974.3.3 Latch Insertion Methodology 984.3.4 Placement of Latches on the Wallace Tree/Final Adder Boundary 1014.3.5 Placing Latches in Wallace Trees 104
4.4 Experiments 1084.4.1 Experiment: Latch Placement on the Wallace Tree/Final Adder Boundary 1094.4.2 Experiment: Placing Latches in the Wallace Tree 1114.4.3 Conclusions 112
4.5 Conclusions 114
5 Minimizing Power Via Inverse Polarity Optimization 1175.1 Introduction 1175.2 Polarity Inversion 1185.3 Design Issues for Polarity Inversion 121
5.3.1 Adder Circuit Designs 1225.3.2 Partial Product Reduction 1255.3.3 Inverse Polarity Wallace Tree Algorithm 127
v Analysis and Design of Low Power Multipliers
5.3.4 Physical Design Issues 1305.4 Experimental Results using Layout Estimation 1335.5 Experiments with Detailed Layout 137
5.5.1 Enhanced Methodology 1375.5.2 Placement Details 1415.5.3 Interconnect Details 1435.5.4 Aspect Ratio Details 148
5.6 Additional Design Considerations 1525.6.1 Wiring Delay 1535.6.2 ’Logic-based’ Delay 1545.6.3 Noise Issues1565.6.4 Further Optimizations157
5.7 Summary 159
6 Conclusions 161
7 Bibliography 165
Analysis and Design of Low Power Multipliers vi
Table of Contents
vii Analysis and Design of Low Power Multipliers
List of Figures
Figure 1. Digital multiplication flow. 21
Figure 2. Partial product generation for 6-bit by 6-bit multiplication. 22
Figure 3. Partial product addition. (a) Full adder cell. (b) Basic ripple adder. (c) We can use ripple addition to add all the shifted copies of the multiplicand. (d) Since there are n-1 ripple adders, each of width n, this basic method takes O(n2). 25
Figure 4. Array partial product reduction- (a) the initial partial product (b) using a row of carry-save full adders to reduce 3 bit vectors down to two (c) resulting PPA (d) full array structure. 27
Figure 5. Wallace tree partial product reduction. (a) partial product array (b) parallel carry-save addition (c) resulting PPA (d) complete Wallace tree structure. 29
Figure 6. Booth recoding (radix 4). 32
Figure 7. Ripple adder 34
Figure 8. Carry Skip Adders: (a) One carry-skip block (b) 12-bit adder. 35
Figure 9. Carry Select Adders: (a) Basic structure of adder, (b) delay chart, and (c) construction for minimum delay. 36
Figure 10. Equation for CMOS power consumption 40
Figure 11. Short circuit current occurs when CMOS devices switch. If the input of the gate is in the region where PMOS and NMOS devices are both on, current will flow directly between the rails. Short circuit current can occur if the gate is not held close to the Vdd or Vgnd rails. 41
Figure 12. Biasing for Vt modification via backgate effect 43
Figure 13. Pulldown stack of 3-input NAND 46
Figure 14. Switching behavior for different input arrival times 47
Figure 15. Output glitching. 48
Figure 16. Sequential vs. parallel layouts for logic blocks. 62
Figure 17. Signal flow in arrays vs. Wallace trees. (a) In arrays, inputs are present at every level of logic depth, so digital circuits at deeper logic levels experience more switching. However, the carry-save adders are arranged in rows, so signals tend to flow in "waves" down the logic. (b) Wallace trees have inputs at one logic level, so input data arrives in parallel and flows downward. However,
Analysis and Design of Low Power Multipliers viii
List of Figures
pth.
e oo that
(d)
rate
s the ses
l of that
some connections "skip" a logic level and so input arrival times tends to be skewed at deeper logic levels. 64
Figure 18. Transition count comparison for multipliers. 65
Figure 19. Full adder, also called carry-save adder, implemented using the ’28T’ construction. 68
Figure 20. Multiplier layout model. 73
Figure 21. Estimated average energy per multiply op. 77
Figure 22. Estimated worst-case multiplier delay (ns). 79
Figure 23. Forms of glitch-inducing delay 85
Figure 24. Using latches to re-time the generation of partial products (Booth). 86
Figure 25. Using latches to equalize signal arrival times in the signal path (Transition Retaining Barriers.) 88
Figure 26. The false switching in an array is linear in terms of the logic dea) For a 16 bit multiplier, the false switching at logic depth 16 is approximately 8 toggles/operation. b) Inserting a TRB saves somof the switching (gray box.) c) The TRB should not be inserted tearly or d) too late, as this lessens the amount of false switching is elminated. 90
Figure 27. Incorporating latching behavior into an inverter. (a) Inverter (b) C2MOS version of inverter. (c) Incorporating state preservation. Using a pass gate. 92
Figure 28. Triggering of latches. A chain of inverters may be used to genethe delay signal. If all latches are driven in parallel a) the final signals should be buffered. Otherwise, b) the delay chain can beused unbuffered, assuming the load is not so great. 94
Figure 29. Width of signal path: (a) In arrays, the width of the signal path iconstant at all logic dephs, but in Wallace trees (b), the width of signal path is greater at shallower logic depths, but width decreaas the logic depth increases. 96
Figure 30. Procedure for latch insertion: 1) create a chain of inverters 2) determine placement of latches 3) set timing of latches 4) trim inverter chain. 98
Figure 31. Potential latch insertion sites: a) at the Wallace tree/final adderboundary and b) in the Wallace tree. 100
Figure 32. "Cascade" triggering style - in this method, the triggering signalatches on the inputs of an adder are incrementally delayed, so
ix Analysis and Design of Low Power Multipliers
in
A’s to
. (b)
l
t
r
the inputs are delayed to compensate for the delay of the carry. 103
Figure 33. Placing latches by logic depth. a), b) calculating logic depth, c) placing latches "one up from the bottom." 105
Figure 34. Placement of latches in Wallace tree. a) Original structure, and b) placement of latches, "one level" up from the PP reduction/adder interface. 106
Figure 35. Placement of latches on Wallace tree/Final adder boundary 109
Figure 36. Placement of latches on within Wallace tree. 111
Figure 37. Inverse polarity optimization (a) Conventional ripple adder. (b) Inverted polarity version (c) Multiplier PPA structure (array). 120
Figure 38. The 28T full adder (CSA) implementation. 123
Figure 39. Conventional implementation vs. inverse polarity equivalence (a) For full adders CSA(a,b,c) = CSA(a,b,c) (b) But for half adders, HA(a,b) != HA(a,b). 124
Figure 40. HAIP designs (a) POS inputs, NEG outputs, (b) NEG inputs, POS outputs 125
Figure 41. Fadavi-Ardekani algorithm (a) Trapezoidal PP array — all bits column n to be added, plus carry-in bits from column n-1. (b) Bits are put in a priority queue (one queue for each column, Fare applied to earliest arriving bits, yielding (c) two bit vectors be added by the final adder. 126
Figure 42. Inverse polarity - (a) arrays have regular alternating structureWallace trees have connections which “skip” a level, causing polarity conflicts at subsequent adder inputs. 127
Figure 43. Basic inverted polarity CSA tiling algorithm. 129
Figure 44. Procedural assembly of final adder (a) Calculate width of finaadder (b) based on width of placement area, (c) determine number of rows needed, and trowel in the final adder. 132
Figure 45. Average short circuit currents (a) full adders have lower shorcircuit currents than (b) inverters. 135
Figure 46. Physical design verification methodology. 139
Figure 47. Placement for (a) 4-bit conventional multiplier and (b) 8-bit inverse polarity multiplier. The final adder is shown in dotted lines. 142
Figure 48. Routing of 8-bit multipliers - (top) Conventional 8-bit multiplieand (bottom) inverse polarity 8-bit multiplier. 144
Analysis and Design of Low Power Multipliers x
List of Figures
Figure 49. Estimated versus extracted net capacitances for an 8-bit inverse polarity multiplier 146
Figure 50. Physical design of Wallace tree multiplier (a) logical structure, (b) idealized physical layout. 149
Figure 51. Routing of 8-bit multipliers, aspect ratio ~1 - (a) Conventional 8-bit multiplier and (b) inverse polarity 8-bit multiplier. 151
Figure 52. Inverse polarity loading effects (a) Use of full adder in PP reduction. (b) Circuit implementation of "(a)" using conventional full adder. (c) Implementation using inverse polarity full adder—note that load on carry output affects delay of sum output. 155
xi Analysis and Design of Low Power Multipliers
List of Tables
Table 1: Booth encoding 31
Table 2: Asymptotic time and space characteristics 38
Table 3. Array multipliers - estimated area (mm2) 76
Table 4. Wallace tree multipliers - estimated area (mm2) 76
Table 5. 8 bit multiplier - estimated delay (ns) 78
Table 6: Count of inverters in multipliers 134
Table 7: Energy / operation for 8 bit multipliers 134
Table 8: Energy / operation for 16 bit multipliers 134
Table 9: Delay, area, wire cap. for 16 bit multipliers 136
Table 10: Simulated and extracted data 145
Table 11: Aspect ratio and capacitance 152
Analysis and Design of Low Power Multipliers xii
List of Tables
xiii Analysis and Design of Low Power Multipliers
1 Introduction
With the increasing level of device integration and the growth in complexity of micro-
electronic circuits, reduction of power dissipation has come to the fore as a primary design
goal. While power efficiency has always been desirable in electronic circuits, only recently
has it become a limiting factor for a broad range of applications, requiring consideration
early on in the design process.
Power dissipation limitations come in two flavors. The first is related to cooling consid-
erations when implementing high performance systems. High speed circuits dissipate large
amounts of energy in a short amount of time, generating a great deal of heat as a by-product.
This heat needs to be removed by the package on which integrated circuits are mounted.
Heat removal may become a limiting factor if the package (PC board, system enclosure,
Analysis and Design of Low Power Multipliers 1
heat sink) cannot adequately dissipate this heat, or if the required thermal components are too
expensive for the application.
The second failure mode of high-power circuits relates to the increasing popularity of por-
table electronic devices. Laptop computers, pagers, portable video players and cellular phones
all use batteries as a power source, which by their nature provide a limited time of operation
before they require recharging. To extend battery life, low power operation is desirable in inte-
grated circuits. Furthermore, successive generations of applications often require more com-
puting power, placing greater demands on energy storage elements within the system.
Technology improvements in the last few decades have succeeded in reducing power con-
sumption. Trends such as using CMOS instead of bipolar devices, and reduction in feature
size of lithographic processes have served to reduce power dissipation, although other objec-
tives, namely high integration and speed, were the primary goals of such improvements. It is
only in recent times that power optimization has been viewed as a design goal in its own right.
This work will look at power optimization applied to a circuit element which is pervasive
in modern microelectronic circuit design, the digital multiplier. In this chapter, we will
describe power optimization in general terms, especially how it fits in to current design flows.
We will briefly describe the multiplier and how it is used for different applications, along with
some design goals for building suitable multipliers. Finally, we give an overview of the issues
2 Analysis and Design of Low Power Multipliers
Power Reduction in Integrated Circuits
investigated in this work, and describe some of the results.
1.1 Power Reduction in Integrated Circuits
Although the above discussion has motivated the ascendancy of power to the attention
of the designer, it is important to understand the place that power has relative to other objec-
tives. After guaranteeing correct digital functionality, the primary consideration for system
designers has always been, and continues to be speed. A circuit is specified to operate at a
particular delay, otherwise the entire system does not work; further reduction in the delay is
beneficial but not strictly necessary. The power dissipation characteristics of a system, on
the other hand, are often a consequence of the delay specification. Once the delay of the sys-
tem is achieved, package cooling and/or battery resources will be allocated appropriately
(within reason). Other factors may have equal or greater importance than power dissipation:
area of implementation and yield/reliability issues are subjects which the designer must take
into account. Nevertheless, the increase in power dissipation in successive generations of
integrated circuit technologies has progressed at such a pace that it is now one of the pri-
mary considerations for designers.
A major complication in microelectronic circuits is the fact that many design decisions
involve a power-delay trade-off; one cannot be lowered without raising the other. It is
Analysis and Design of Low Power Multipliers 3
itical
sipa-
which
not
ost
manu-
uc-
l of the
s
le, low-
nstant
y volt-
oltage
n-
ils for
important to note, however, that power reduction techniques are not necessarily negatively
correlated to delay reduction. For example, one method to reduce delay in a circuit’s cr
path is to upsize the driving strength of gates, which also results in increased power dis
tion. However, another way to lower delay might be to reduce interconnect capacitance,
reduces both power and delay. Generally, great power savings can be achieved if delay is
an issue, but optimizing power without considering delay is trivial.
Power reduction techniques are applied at all levels of the design hierarchy. At the m
basic level, the parameters of the lithographic process in which the integrated circuit is
factured may be modified. Doping concentrations, minimum geometrical spacings of str
tures, etc., all affect power dissipation. Many of these parameters are beyond the contro
circuit designer. Some of these ‘global’ constraints, however, may be modified at variou
stages of the design methodology, although they are by and large left alone. For examp
ering of Vdd is a well-accepted method of reducing chip power, and there has been a co
trend towards running even the most high-performance microprocessors at lower suppl
ages, despite the delay penalty that low supply voltage incurs. (Lowering Vdd goes hand-in-
hand with other scaling schemes, such as gate oxide reduction, line-width scaling, etc. V
reduction may be seen as a consequence of scaling, and attendant power reduction is a sere
dipitous result.) Other schemes, such as using multiple supply voltages (high voltage ra
4 Analysis and Design of Low Power Multipliers
Power Reduction in Integrated Circuits
clock
s are
n
thod-
n
th
vel
rs
ates
high speed sections, low voltages otherwise) have been proposed, although their use to date
has not been large.
Power reduction techniques applied to the design hierarchy for digital CMOS devices
can be subdivided as follows:
• Chip/floorplan level - At this point, the power characteristics for the entire die are
planned and power/delay ‘budgets’ are allocated. Vdd and Vth are determined based on
performance goals. Power due to the assembly of macroblocks (interconnect and
nets) is optimized, subject to delay and area/routability constraints. As macroblock
instantiated, this high level information is updated and budgets may be adjusted to
increase/decrease specifications on other sub-blocks.
• Macroblock level - This stage comprises the assembly of gates into a basic functio
such as a block of control logic, or an arithmetic unit. In a standard cell design me
ology, it is at this point that the implementer’s intellectual property enters the desig
flow. As such, the various methods of implementing a particular function impact bo
power and delay. The number of power optimization techniques available at this le
are numerous.
• Gate/circuit level - This is lowest level stage visible to the designer, where transisto
are assembled into basic gates. In a standard cell methodology, the fundamental g
Analysis and Design of Low Power Multipliers 5
used in macroblock assembly are defined here. Power optimizations are very restricted at
this level, as delay specification often dictates the power at which a device will operate.
Emphasis is placed on reducing device parasitics and area while maximizing routability. A
full custom approach may succeed in achieving lower power than a standard cell
approach, as individual transistor sizing may overcome certain obvious inefficiencies.
More aggressive circuit families, such as dynamic logic, dual rail logic, or low-swing
logic can be implemented at this stage, but these often require complicated design flows
(e.g., through the introduction of clocks, noise sensitive nets, the need for level conver-
sion, etc.)
In our work, we focus on a standard cell CMOS methodology, as this is the most common
method for quickly and efficiently assembling a digital integrated circuit. As such, a number
of clever and useful circuit techniques cannot be applied due to the complications which their
use introduces in the design flow. Therefore, the performance of designs achieveable by a
standard cell based implementation is sub-optimal, but the ease of implementation of such a
flow allows exploration of a wide range of designs.
1.2 Integrated Circuit Multiplication
Digital multiplication is one of the most basic functions in a wide range of algorithms[4].
6 Analysis and Design of Low Power Multipliers
Delay and Power in Multipliers
The ubiquity of this operation in computing has given rise to a large number of multiplier
implementations, each with different specifications and goals. Some applications require
wide dynamic range, others need high precision, while in some cases, neither of these char-
acteristics are very tightly specified. Digital multiplication is used as opposed to analog
when high precision is an issue; it is fairly straightforward to make digital multipliers as
accurate as the application requires. Precision required for multiplication varies by function.
At the low end, 8 bits are needed, e.g., in image compression algorithms, or 16 bits in more
precise DSP tasks. At the high end, we see 53 bit and 64 bit multiplication (IEEE double
precision standard[26].) Typically, we see 16 bit multipliers used for digital signal process-
ing and 53/64 bit multipliers used in microprocessors.
The basic operation in these designs is integer multiplication; in floating point multipli-
ers, integer multiplication units are sub-blocks of the greater floating point unit. Signed ver-
sus unsigned techniques have an impact on the design, and some clever techniques have
been suggested for manipulating the bit representation of numbers to generate power sav-
ings. However, the primary consideration in multipliers has been and continues to be delay.
1.3 Delay and Power in Multipliers
This thesis focuses on multiplication performed in digital signal processing (DSP) algo-
Analysis and Design of Low Power Multipliers 7
y
se
ata
per-
its
) is
rithms. Multiplications in this regime typically require a precision of 8 or16 bits. From a delay
perspective, algorithms place two constraints on multiplication: latency and throughput.
Latency is the real delay of computing a function, a measure of how long after the inputs to a
device are stable, is the final result available on the outputs. Throughput is a measure of how
many multiplications can be performed in a given amount of time. For a simple combinational
multiplier, throughput is a function of latency. However, various techniques exist which can
compute several multiplications in parallel, e.g., through pipelining; in these cases, latency is
only loosely correlated to thoughput.
For digital signal processing, throughput is a major concern. DSP algorithms are often
related to perception of audio/visual stimuli, for example, image or voice transmission and
recognition. In these tasks, precision requirements are less stringent than for other applica-
tions (e.g., numerical algorithms for scientific computing) so small bit-width multipliers may
be used—latency is a function of bit width, and small multipliers do not create long dela
paths. For many DSP applications, the relevant limiting specification is throughput. The
tasks often require fairly coarse resolution of images but operate on a large amount of d
representing different image or sound samples. For example, image rendering requires
forming computations on a large number of polygons, whereas the precision involved (b
required for identifying a particular color, bits required for identifying spacial coordinates
8 Analysis and Design of Low Power Multipliers
Delay and Power in Multipliers
opti-
d, the
one.
ou-
often
other
avior
fairly small. Voice compression requires a large number of 8 bit calculations. One way of
processing this large number of computations quickly can be achieved by lowering the
latency of the multiplication; in this manner, the multiplier can start performing the next
operation sooner.
A more efficient method to increase the number of computations is to increase the
throughput. Various schemes are possible; for example, pipelining/interleaving of data
allows one functional unit to compute several operations concurrently, while implementing
multiple devices on one chip simply increases the throughput by the number of additional
units. These techniques tend to be more efficient than latency reduction, because if one tries
to lower the delay of a circuit, diminishing returns are quickly encountered (if a circuit’s
transistors are upsized, at a certain point, the delay does not decrease further.) When
mizing throughput, on the other hand, for each additional functional block that is adde
number of operations which may be computed in a given amount of time increases by
Although pipelining is good for throughput, it may be hard to implement in tightly c
pled hardware/software systems. While the logic implementation of pipelining is fairly
straightforward, getting compilers to build programs to take advantage of pipelining is
difficult. The problem lies in setting up a series of operations to begin execution while
operations have yet to finish. These ‘parallel’ or ‘multiple-issue’ modes of system beh
Analysis and Design of Low Power Multipliers 9
have timing dependencies which complicate the task of writing a compiler that can take
advantage of such hardware techniques. Moreover, while some DSP algorithms lend them-
selves to parallel operation, others require processing to be more sequential in order, rendering
the additional pipeline hardware useless. Finally, extremely high speed code is often imple-
mented by hand in assembly language. Understanding all the methods of optimizing a pipe-
lined function can be very tedious if done manually. These reasons argue for using multiple
multiplication functional units on one chip, as opposed to implementing a heavily pipelined
multiplier[35].
The multiplier is a fairly large block of a computing system. The amount of circuitry
involved is proportional to the square of its resolution (i.e., a multiplier of size n bits, has
O(n2) gates.) Not only is the multiplier a high delay block, but it can be a significant source of
power dissipation. Based on the argument delineated above, that several multipliers should be
present on-chip as more DSP compute power is needed, the power dissipation involved in
multiplication will become more dominant. (Even if the pipelined approach is used, to a first
order, the pipelined multiplier will dissipate as much power as several multiplier blocks.
Although a pipelined version has fewer gates, it will still experience roughly the same amount
of switching.) Therefore, digital multipliers have become one of the prime circuits targeted for
power reduction [39][41][42].
10 Analysis and Design of Low Power Multipliers
Power vs. Energy
ay not
and
mount
ther-
a
dissi-
There-
c-
ity,
1.4 Power vs. Energy
The distinction between the terms power and energy is important to this discussion. Note
that energy is a measure of the total number of Joules dissipated by a circuit, whereas power
refers to the number of Joules dissipated over a certain amount of time. Properly speaking,
power reduction is a different goal than energy reduction.
Power can be a problem primarily when heat removal is a concern. If too many Joules of
energy are converted into heat within a short amount of time, a package’s heat sink m
be able to redistribute this heat quickly enough; the result will be a rise in temperature
subsequent thermal failure. If the same amount of energy is generated over a longer a
of time, power dissipation is reduced and the cooling structure can better deal with the
mal demands of the circuit. Here, power reduction consists of reducing the case when
large amount of energy is dissipated in a short amount of time. Again, the total energy
pated may remain the same.
Energy reduction consists of reducing the total number of Joules to be dissipated.
fore, we often speak of energy per operation as the metric to be optimized; time is not a fa
tor in this calculation. Power reduction, then, belongs to the domain of thermal reliabil
whereas energy reduction lies in the domain of maximizing battery lifetime.1
In digital CMOS, one often hears of a power-delay trade-off, or of a circuit operating at
Analysis and Design of Low Power Multipliers 11
or
g
ly,
tems
.
ll as
ed
net
h-
a point in the power*delay space. This power*delay is continually being improved, (e.g.,
using more advanced processes or better logic designs.) In a sense, this is a misnomer as
power*delay = (energy/delay)*delay = energy; this implies one should minimize energy—
more importantly, minimizing delay is irrelevant. Instead, one should speak of minimizin
energy*delay, as this metric involves two independent measures of circuit behavior.
The literature consistently refers to minimizing power, whereas the techniques described
in almost all cases involve minimizing energy. The two terms tend to be used interchangeab
with ‘power’ being used where ‘energy’ should be used instead. This usage most likely s
from this field of research being known as low-power circuit operation. We will retain the use
of ‘power’ because of its standard usage but will use ‘energy’ where clarity is warranted
1.5 Research Approach
In this work, we wish to identify the relevant components which impact power as we
delay of multipliers. As mentioned above, power as a figure of merit should be consider
concurrently with delay; a power reduction with proportional delay increase achieves no
advantage.2 In the first phase of our research, we focused on existing delay reduction tec
1. Note that power reduction is also relevant to battery operation. Various types of battery are more/less capable of providing bursts of high current[22].
2. All things being equal. We will not consider the argument of Chandrakasan, et. al.[6], where voltage scaling and functional unit parallelism are exploited to achieve a power savings.
12 Analysis and Design of Low Power Multipliers
Research Approach
niques for multipliers and looked at their power dissipation properties. The field of delay
reduction for digital multipliers is a well developed one which spans at least three decades.
Delay targets can be met using different delay optimizations, and in some cases, multiple
methods may be used concurrently. It is of great interest to identify which techniques should
be applied to reduce delay if one also wishes to minimize power consumption. Those
approaches which show the most promise, as well as those which have not been previously
explored are naturally the subject of our most intense focus. In the second phase of this
work, we attempted to expand the repertoire of possible techniques by investigating ideas
for power reduction which suggested themselves from the initial analysis. Not all the ideas
which we implemented were successful. Nevertheless, we did identify a few techniques
which show promise for power reduction in multipliers.
Our experimental procedure was based on the standard cell design methodology which
is common in industry today. We started from a small library of basic CMOS cells necessary
to implement our functions. These were extensively characterized for energy dissipation and
delay under a wide range of input slopes and output loads. The multiplier logic was derived
from standard descriptions existing in the literature, and the actual logic instantiation was
performed automatically using assembly algorithms derived from the literature, along with
some optimizations which we contributed to the process. Once the multiplier had been
Analysis and Design of Low Power Multipliers 13
e
r tools
ndor
this
assembled, the contribution of physical effects was estimated using a placement and routing
stage, which attempted to determine rough wiring characteristics based on routing length. In
later experiments, results were verified by performing a full layout using the Cadence tool
suite.
Given the logic and physical description of the function, timing analysis was performed
using static delay information from the characterized standard cells. Power was determined
using a logic simulator which counted switching events and estimated glitches. Additional
power numbers were derived from runs using HSPICE and Star-sim from Avant! Corpora-
tion[51][52].
Our design methodology evolved over time, as our tools were refined and accuracy was
improved. Final values were checked using vendor tools where appropriate. Two goals moti-
vated the design of this CAD methodology. First, we wanted to perform certain tasks which
were not possible using vendor tools—some of these, such as multiplier assembly, wer
clearly necessary as this represents the contribution of this thesis. Second, while vendo
performed some of the tasks necessary for this work, (e.g., Synopsys - Design Compiler could
have been used for static timing analysis), we required several slight modifications to ve
tool functionality (e.g., timing analysis computing the static longest path) which were difficult
to achieve. Although coding such procedures increased the time necessary to complete
14 Analysis and Design of Low Power Multipliers
Thesis Outline
on
d
r effi-
project, we also received the benefit of having access to these procedures at a very fine
detail (for example, we were able to incorporate timing code in the inner loop of our multi-
plier assembly algorithm.)
1.6 Thesis Outline
This thesis can be divided into three separate projects, which investigated different
aspects of multiplier power dissipation. The overall thesis is organized as follows.
• Background: multipliers and power dissipation - (chapter 2): we first describe the
domain of our investigation, power dissipation in multipliers. We present a description
of digital multipliers and their component parts, along with optimizations that have been
proposed to speed up the operation. Furthermore, we describe the state of the art in
power reduction for multipliers. This is followed by an analysis of power dissipation in
static CMOS digital logic, the standand form used to implement logic functionality.
Some basic techniques for CMOS power reduction are briefly described.
• Arrays vs. Wallace trees applied to partial product reduction - (chapter 3): our initial
work investigated two approaches to multiplier partial product reduction, focusing
the power dissipation characteristics of these two. Prior to our investigation, we ha
found contradictory data, suggesting that each of these two forms was more powe
Analysis and Design of Low Power Multipliers 15
es
. Our
s. The
llace
s
ly anal-
le
e
dissipa-
g this
e
ore,
cient than the other[37][38]. By considering switching activity and parasitic capacitance
due to interconnect, we were able to demonstrate which of the two designs offers the most
promise for low power operation.
• Transition retaining barriers - (chapter 4) we developed an idea for transparent latch
to “re-synchronize” signals propagating through partial product reduction techniques
goal was to compare the potential for this technique in both arrays and Wallace tree
analysis for arrays exists in literature[45], so we attempted to apply this method to Wa
trees. Unfortunately, the characteristic behavior of signal propagation in Wallace tree
makes this approach unsuitable, and power reductions were not obtained in our ear
ysis.
• Inverse-polarity techniques - (chapter 5): this approach investigated the potential of
removing inverters in adder blocks within multipliers. The technique comes from ripp
adders, which use this optimization to reduce logic depth as a delay optimization. W
adapted this technique to attempt transistor count reduction, so as to reduce power
tion. Our experiments show that power reductions of up to 25% can be achieved usin
technique. Various characteristics of the logic function, such as delay, area, and nois
properties are impacted when using this method and are further described. Furtherm
we attempted to verify layout estimated by performing full physical design on an
16 Analysis and Design of Low Power Multipliers
Thesis Outline
advanced process. Our analysis shows that our estimates were very close to data
obtained through complete design of the multipliers.
We conclude the thesis with some general observations about power reduction for multi-
pliers. There are several unexplored prospects for power reduction which warrant further
investigation, including a few proposed by results described in this thesis, which we will
discuss in this last section.
Analysis and Design of Low Power Multipliers 17
18 Analysis and Design of Low Power Multipliers
2 Background: Multipliers and Power Dissipation
In this chapter, we discuss some of the basic concepts which are common to the areas
which we investigated. First, we present a brief description of digital multipliers, including
their structure and relevant components. Delay reduction techniques are also discussed.
Next, we go over power dissipation in CMOS circuits, along with some basic techniques
which can be applied to reduce power.
2.1 Multipliers
In order to understand delay and power trade-offs in multipliers, we will describe the
basic circuit structure of digital multiplier implementations in more detail. We wish to
examine some of the techniques which have been developed to reduce multiplier delay, par-
ticularly to gain an understanding of their characteristic power dissipation.
Analysis and Design of Low Power Multipliers 19
While some insight can be gained through direct observation of logic structure, power dis-
sipation comes from several sources; techniques which reduce the power due to one of these
sources can worsen the power dissipation due to another. A brief discussion of sources of
power dissipation in digital CMOS will illustrate the relevant effects. To incorporate all the
relevant power dissipating effects into our analysis, we chose to evaluate multiplier designs by
developing a methodology which uses low-level circuit simulators to calculate power con-
sumption. To compensate for the slowness of a detailed circuit-based approach, a simulation-
based evaluation program allowed quick analysis of power and delay for these designs, and
accuracy was verified using the detailed circuit simulators.
2.1.1 Multiplier StructureAt the most basic level, digital multiplication can be seen as a series of bit shifts and bit
additions, where two numbers, the multiplier and the multiplicand are combined into the final
result. Consider the multiplication of two numbers: the multiplier P, and multiplicand C,
where P is an n-bit number with bit representation {pn-1,pn-2,...,p0}, the most significant bit
being pn-1 and the least significant bit being p0; C has a similar bit representation {cn-1,cn-
2,...,c0}. For unsigned multiplication, up to n shifted copies of the multiplicand are added to
form the result. The entire procedure is divided into three steps: partial product (PP) genera-
20 Analysis and Design of Low Power Multipliers
Multipliers
te,
tion, partial product reduction, and final addition. This is illustrated conceptually in Fig. 1.
2.1.2 Partial Product GenerationThe initial step in digital multiplication is to generate n shifted copies of the multipli-
cand which may then be added in the next stage. Whether a given shifted copy of the multi-
plicand is added depends on the value of the multiplier bit corresponding to this
multiplicand copy. If the ith bit, (i = 0 to n-1) of the multiplier is ‘1’, then the multiplicand is
added. If the bit is ‘0’, the multiplicand is not added.
The bit representation of this function can be implemented using a logical AND ga
Partial Product Array Generation = n shifted binary numbers
Final addition = 2n-bit final product
n-bit inputs operands (n = 4)A B
Partial Product Array Reduction = reduction to 2 binary numbers
Figure 1. Digital multiplication flow.
Analysis and Design of Low Power Multipliers 21
which performs AND(ci,pj), i = 0 to n-1, j = 0 to n-1. The resulting values are called partial
product bits or simply, partial products; if we line up these partial products by bit position
(bp = i + j), we arrive at a structure shown in Fig. 2. In this design, the partial product bits are
arranged in columns, which are to be added together to form the final value. The resulting
trapezoidal structure is called a partial product array or simply PPA. This step is called partial
product array generation.
Various forms of partial product arrays exist, depending on the number representation. For
example, in the following section we describe a common variant called Booth recoding,
which allows a signed multi-bit partial product representation of the design. Common variants
=
ppij
multipliermultiplicand
c0c5 p0p5
columns to be added
Figure 2. Partial product generation for 6-bit by 6-bit multiplication.
22 Analysis and Design of Low Power Multipliers
Multipliers
n the
par-
array
por-
li-
lumns
uire
t the
ult in
the
rations,
sion of
ipli-
lobal
for efficiently implementing two’s complement are described in [27].
There are several important points to notice about the partial product array. First, i
most basic formulation (PPA bits generated via logical AND), all the bits are created in
allel; that is, the static delay of each of the bits is equal. Second, the dimensions of the
are functions of the size of the multiplier and multiplicand: the height of the array is pro
tional to the size of the multiplier, and the width is proportional to the size of the multip
cand. Finally, all the bits in a particular column are to be added together, and some co
have fewer bits than others. Specifically, low-order and high-order bit positions will req
fewer additions than the middle bit positions; slightly more additions will be required a
high-order positions than at the low-order ones, as carries from low order positions res
a larger number of bits to be added at the high order bit positions.
2.1.3 Partial Product ReductionThe heart of an efficient digital multiplier implementation is in the manner in which
PPA bits are added. Were conventional carry adders used to implement these add ope
the delay of all the adders would consume a large amount of time, as each shifted ver
the multiplicand would contribute a delay which is proportional to the width of the mult
cand. Instead, the partial product is reduced using a technique called carry-save
addition [31]. This procedure allows successive additions to be incorporated into one g
Analysis and Design of Low Power Multipliers 23
addition step.
Consider a numerical bit vector representation of the following form: (bn-1,bn-2,...,b1,b0),
bi = {0,1}, . If we wish to add two bits from two bit vectors, say a0 and b0, from bit
vectors a and b, we can use the full adder (Fig. 3a); it takes in three bits and produces a sum
bit, and a carry bit. When adding two vectors together, this block can be used to add two bits
at a given bit position with the carry-in from the previous bit position. Consider the case
where two bit vectors are to be added. At the lowest bit position, two bits are added, and the
carry is propagated to the next bit position. From then on the carry-in and the next two bits at
the higher bit position are combined, and a carry-out is generated. Using this rippling tech-
nique, we see that adding two n-bit number takes O(n) sequential bit additions, resulting in a
delay of O(n).
If we have to add three bit vectors, A, B, and C, each of size n, we can use this method to
add first A and B, and then to add C to the result of A+B. The number of bit additions is O(2n).
We see that if we were to use this technique in the most simple-minded fashion to add n
shifted copies of an n-bit multiplicand, the delay will be O(n2). This occurs because we
assume that the add operations are dependent on previous add operations (the output of earlier
operations are inputs to later operations). See Fig. 3.
0 i n<≤
24 Analysis and Design of Low Power Multipliers
Multipliers
Although the final result comes about from combining all add operations, a certain
amount of independence exists between each operation. We can consider the add operation
on a column by column basis; all the bits in a column must be added together, along with the
carry-in bits of the previous column. The delay in calculating the output of a given column
is a function of when the carry-in bits (from the previous column) are available, as well as
the number of bits which are to be added (Fig. 3.) Carry-save addition leverages the fact that
the add operations in separate columns can be performed independently. If we are to add
three vectors of bits, we can use full adders to add the three bits in each column. The result
is a carry and a sum bit at each bit position, except at the lowest and highest bit position,
b0 a0
....
b1 a1b2 a2
F A
ba
cincout
sum
FAFAFA
sum0sum1sum2
....
(d)
(a) (b)
(c)
Figure 3. Partial product addition. (a) Full adder cell. (b) Basic ripple adder. (c) We can use ripple addition to add all the shifted copies of the multiplicand. (d) Since there are n-1 ripple adders, each of width n, this basic method takes O(n2).
Analysis and Design of Low Power Multipliers 25
ectors,
rs are
r
save
sec-
ltipli-
4b.
rs.
r row
stan-
which have one bit each. We see that three bit vectors have been “reduced” to two bit v
in the delay necessary for a full adder. Using this technique called carry-save addition, we can
’reduce’ a set of vectors to be added together down to two bit vectors. Since the full adde
the basic addition element, full adders used in this context are often called carry-save adders
or CSA’s. Using carry-save addition is the prime reason why multiplication is much faste
than would be predicted by the number of additions necessary.
2.1.4 Array Style ReductionGiven the trapezoidal array of partial product bits which must be added using carry-
addition, there exist several ways to implement a partial product reduction adder. In this
tion, we describe the most basic method, called array partial product reduction.
For example, in Fig. 4a, we see the trapezoidal PPA generated for a 6-bit by 6-bit mu
cation. We can take the first three bit vectors, and add them using full adders, as in Fig.
Combining the results of the addition with the remaining bits of the PPA, we get a result
which appears in Fig. 4c. As we can see, three vectors have been reduced to two vecto
Note that the outputs of this first set of full adders can now serve as inputs to anothe
of full adders, along with another bit vector. This structure repeats until the full array is in
tiated as in Fig. 4d.
26 Analysis and Design of Low Power Multipliers
Multipliers
The notable characteristic about the array architecture is its regular structure. This has
the advantage that it is very easy to lay out, as a single adder block and associated connec-
(a) (b)
(c)
FAFAFAFAHA HA
FAFAFAFAHA HA
FAFAFAFAFA HA
FAFAFAFAFA HA
FAFAFAFAFA HA
(d)
Figure 4. Array partial product reduction- (a) the initial partial product (b) using a row of carry-save full adders to reduce 3 bit vectors down to two (c) resulting PPA (d) full array structure.
Analysis and Design of Low Power Multipliers 27
e three
edure
can
tions are replicated the width and depth of the array. The delay of this block is a function of
the number of rows, O(n), which is a big improvement over the simple-minded scheme of
using conventional adders for each row. However, it is possible to do better.
2.1.5 Wallace Tree Partial Product ReductionIn 1964, C.S. Wallace[23] observed that the later stages of the array structure must always
wait for all the earlier stages to complete before their final values will be established. When
performing a series of independent add operations, it is possible to create a structure which
has less delay by performing the addition operations in parallel, where possible. For example,
in the partial product array for 6-bit x 6-bit multiplication, two carry-save reductions can be
done in parallel, resulting in a smaller PPA after just one step (Fig. 5a-c.) This can be repeated,
yielding the structure shown in Fig. 5d. (The final set of connections is somewhat complicated
to draw here.)
Obviously, parallelizing the carry-save operations will yield a delay which is much shorter
than the array’s sequential series of operations. When using carry-save addition, we tak
input bit vectors and reduce these to two output bit vectors. A sequential carry-save proc
will reduce the number of bit vectors by one at each stage, whereas the parallel method
28 Analysis and Design of Low Power Multipliers
Multipliers
take sets of 3 vectors and reduce them to sets of 2 vectors. The number of stages, and hence
the delay of the sequential operation will be O(n), whereas the parallel method will be
FAFAFAFAHA HA
FAFAFAFAHA HA
(a) (b)
(c)
(d)
Figure 5. Wallace tree partial product reduction. (a) partial product array (b) parallel carry-save addition (c) resulting PPA (d) complete Wallace tree structure.
Analysis and Design of Low Power Multipliers 29
se-
ts are
oved
t
e. In a
O(log3/2 n).
Such parallel arrangements of CSA blocks are called Wallace trees and allow for a large
reduction in the delay of the partial product reduction stage. The disadvantage of Wallace
trees lies in their irregular layout (especially with respect to array structures), resulting in
potentially greater wire loads. Furthermore, note that array structures result in a final add
operation of width n, whereas the final adder in Wallace trees has a width of approximately
2n - log3/2 n.
2.1.6 Partial Product Reduction/Generation Using Booth RecodingThe technique of Booth recoding is based on the observation that under certain conditions,
namely when a bit in the multiplier is ‘0’, a bank of carry-save adders does not perform a u
ful function; that is, a ’0’ is added to the accumulated carry-save result, and the input bi
simply propagated to the output bits. In this case, these carry-save adders could be rem
from the multiplier structure, resulting in delay and power savings. Unfortunately, it is no
generally possible to know a priori what bits of the multiplier will be ‘0’. To maintain gener-
ality, we must provide for the case when all bits of the multiplier are ’1’.
It may be possible to reduce circuitry, however, if one considers the largest delay cas
4 bit x 4 bit multiplication, circuitry must be provided for the case where the multiplier is
‘1111’---resulting in a delay of 4 stages. An important observation is that multiplying by
30 Analysis and Design of Low Power Multipliers
Multipliers
fore,
com-
ed
m
‘1111’ is the same as multiplying by ‘10000’, and subtracting the multiplicand from the
result--- multiplying by a power of two is simply a shift, so this costs two stages. There
we have cut down our worst case from 4 stages of delay to 2 stages of delay.
This type of stage reduction can be generalized into the technique known as Booth
recoding. Three bits of the multiplicand are used to determine whether a shifted and/or
plemented copy of the multiplicand are to be used. Two bits of the multiplicand are us
multiplexed to created the actual value. The encoding scheme is shown in Table 1, fro
[33].
Table 1: Booth encoding
x2i+1 x2i x2i-1 di
0 0 0 0
0 0 1 1
0 1 0 1
0 1 1 2
1 0 0 -2
1 0 1 -1
1 1 0 -1
1 1 1 0
Analysis and Design of Low Power Multipliers 31
ding is
here
rac-
more
The connections of the Booth multiplexors are shown in Fig. 6.
The net result is that the size of the PPA array is reduced, since fewer shifted copies of the
multiplicand are necessary in the partial product array. The number of these multiplicand
‘copies’ needed in the partial product array depends on the degree to which Booth reco
applied. Generally, each level of recoding cuts the number of partial product bits in half. T
is additional circuitry involved in performing the recoding however, so this optimization
entails inserting complicated logic which itself adds delay and power consumption. In p
tice, Booth recoding is applied to one level (called radix-4) or two levels, but hardly ever
multiplier
booth decoders
booth encoders
multiplicand2 2 2
33
3
3
3
Figure 6. Booth recoding (radix 4).
32 Analysis and Design of Low Power Multipliers
Multipliers
than two levels.
Booth recoding was initially applied to array multipliers, where a reduction in the size of
the PPA by two means reducing the delay of the partial product reduction stage by a factor
of two. Note, however, that cutting the size of the PPA in half has less of an impact on the
Wallace tree scheme, since Wallace trees reduce the size of the PPA at a logarithmic rate; the
savings from Booth recoding yield a reduction of 1-2 levels of logic. Furthermore, there is
logic involved in the actual Booth recoding process. Therefore, it is unclear whether there is
an advantage in applying Booth recoding to Wallace trees. Doubts have been expressed
about the validity of Booth recoding for Wallace trees even for 64-bit x 64-bit multipli-
ers[28].
2.1.7 Final AdderA common method for achieving low delay in multipliers is to speed up the final addi-
tion stage. Several optimizations exist for performing high-speed addition, as summarized in
[31]. The straightforward application of these designs to multipliers has resulted in various
designs with high speed or low power [37]. In our analyses, we considered carry ripple,
carry skip, and carry select adder structures for this final addition step.
Ripple Adder
As noted previously, the ripple adder is the slowest yet lowest power adder implementa-
Analysis and Design of Low Power Multipliers 33
1’; if
a rip-
e
tion. The ripple adder is of great interest because higher speed adders often incorporate the
ripple structure into sub-blocks of the greater high speed adder structure. Therefore the speed
and power properties of the ripple adder determine the overall performance of a wide range of
adder designs. The structure of ripple addition is shown below in Fig. 7. As can be seen, the
delay is linear in the number of addition stages.
Carry-Skip Adders
Carry-skip adders are based on the principle that only under certain conditions can a bit
ripple all the way through an adder, from a low bit position to a high bit position. Assuming a
carry-in of ‘1’, all pairs of bits to be added must contain at least one bit whose value is ‘
this condition is not met, the rippling action stops. If the condition holds, any carry-in to
pling block will be propagated to the carry-out, and the output will be ‘1’. In this case, th
carry-in ‘skips’ the rippling block. For the ripple adder above, the skip condition is: (a0 +
b0)(a1 + b1)(a2 + b2)(a3 + b3).
FA FA FA FA
a0 b0 a1 b1 a2 b2 a3 b3
s0 s1 s2 s3
c0 c1 c2c3cin
( = 0 )
Figure 7. Ripple adder
34 Analysis and Design of Low Power Multipliers
Multipliers
’
is
n the
ome
[33].
A carry-skip adder is shown below in Fig. 8. The add operation is divided into ‘skip
blocks, and the computation is performed for each block. A conventional ripple adder
included with the skip block to form the final adder. The delay of this block depends o
size of the internal skip blocks as well as the delay for the skip condition calculation; s
research has been done on determining the optimal arrangement of carry-skip adders
Ultimately, carry-skip adders are best used in technologies where rippling is very fast.
FA FA FA FA
a0 b0 a1 b1 a2 b2 a3 b3
s0 s1 s2 s3
cin
skip
cout
FA FA FA FA
a0 b0 a1 b1 a2 b2 a3 b3
s0 s1 s2 s3
skip
FA FA FA FA
a0 b0 a1 b1 a2 b2 a3 b3
s0 s1 s2 s3
skip
FA FA FA FA
a0 b0 a1 b1 a2 b2 a3 b3
s0 s1 s2 s3
skip
Figure 8. Carry Skip Adders: (a) One carry-skip block (b) 12-bit adder.
Analysis and Design of Low Power Multipliers 35
’ at the
e
plot
Carry-Select Adder
The carry select adder optimization is based on parallel computation: for a sub-block of
bits to be added, two ripple adders are instantiated, one which assumes a carry-in of ’0
lowest bit position, while the other assumes a carry-in of ’1’. When the actual value of th
carry-in to the block is known, the correct ripple adder is selected via a MUX (Fig. 9a). A
of the delay by bit position is shown in Fig. 9b.
FAFAFAFA 0
FAFAFAFA 1
b0a0b1a1b2a2b3a3
muxmuxmuxmux cinmuxcout
s0s1s2s3
(a)
delay
0247
bit position
c00,c01
c10,c11
c30,c31
cout = c3
c20,c21
(b)
inputs arrive
max delay
Carry Select Adder:
(c)
Figure 9. Carry Select Adders: (a) Basic structure of adder, (b) delay chart, and (c) construction for minimum delay.
delay
11
0123
36 Analysis and Design of Low Power Multipliers
Multipliers
The carry select adder speeds up addition by implementing blocks of ripple adders
which operate in parallel. The overall final result comes from a chain of MUX elements
which choose the correct sequence of input carries. Although carry select adders achieve
delay reductions, they tend to be higher in power due the greater amount of circuitry needed
to calculate the addition. A low delay implementation of the carry-select adder can be con-
structed as follows. Starting with a basic ripple adder of two bits at the lowest bit position
(bits 0 and 1), a carry select block is constructed at bit positions 2 and 3. MUX elements
have approximately the same delay as a stage of the carry chain. Therefore, the next carry-
select block will be of length 3, at bit positions 4, 5, 6. Carry select blocks therefore start at
bit positions 2,4,7,11,...and so on. This is shown conceptually in Fig. 9c. In this formulation,
the delay of the adder will be proportional to the number of carry-select blocks.
A final type of adder is the Carry-Lookahead Adder, which is one of the fastest tech-
niques to perform addition. A brief discussion can be found in [31].
Adder Comparisons
Adder implementation characteristics are summarized in Table 2, taken from[31]. There
is an area-delay trade-off between adders which is partly shown in this chart. Although the
asymptotic delays of carry skip adder and carry select adders are similar, the carry select
tends to be faster but larger than the carry skip adder, while the carry skip adder tends to be
Analysis and Design of Low Power Multipliers 37
slower but smaller than the carry select adder. This fits the size and delay trends when com-
pared to the ripple adder.
2.1.8 SummarySince multiplication is a very complex operation, a number of optimizations have been
devised to reduce its delay. The multiplier can be speeded up by reducing the delay in several
of its component blocks. Recent considerations of power dissipation may bias which ways the
multiplier delay is reduced.
2.2 CMOS Power
The implementation of a multiplier in CMOS digital logic involves various trade-offs
which are particular to the technology. For example, a bipolar implementation would dictate
minimizing circuitry due to the static power dissipation component inherent in bipolar gates.
CMOS on the other hand does not suffer from as significant static power dissipation, and thus
is more amenable to adding devices.
Table 2: Asymptotic time and space characteristics
Time Space
Rjpple O(n) O(n)
Carry skip O(n)
Carry select O(n)
O n( )
O n( )
38 Analysis and Design of Low Power Multipliers
CMOS Power
2.2.1 Static vs.Dynamic PowerThere are two modes of power dissipation in integrated circuits: power generated during
static operation versus that generated during dynamic operation. Static power dissipation is
a function of all currents which flow when no switching is occurring. These include currents
due to pn-junctions, static current due to biasing of devices, and leakage currents. Dynamic
power is a result of switching activity, when currents cause capacitances to charge or dis-
charge while performing logic operations. Dynamic current can be large or small depending
on desired delays and capacitances present.
Static CMOS has become the logic family of choice for digital circuit implementations
for several reasons, one being desirable power dissipation characteristics. Due to the com-
plementary nature of the PMOS and NMOS devices, CMOS has low static power dissipa-
tion as one device is generally off when the other is on. Furthermore, there is no D.C. input
current, meaning that very little current flows when the device is not switching. Dynamic
power dissipation in CMOS can be described by the equation in Fig. 10.
Analysis and Design of Low Power Multipliers 39
Static power dissipation in CMOS is due to leakage currents. Reverse biased pn junctions
form a part of this current, although their contribution is relatively small. More problematic is
subthreshold leakage, where current flows across a transistor when it is nominally "off". For
example, if an NMOS device has 3.3V on its drain, and 0V on its gate and source, ideally no
current flows. In reality a very small current is present, which is an exponential function of the
gate-to-source voltage. While this current is very small (~10 picoAmperes for an NMOS
device in a .35µm process), process shrinks have shown that this current has been increasing
in successive generations (for the same family of processes as above, subthreshold leakage is
~1nAmp at .25µm). Considering the large number of devices on a die, this effect may contrib-
ute substantially to power dissipation for future processes.
Dynamic power dissipation is a function of the behavior of CMOS during logic opera-
tions. Two primary currents are present, one which goes to charge capacitances of devices at
P = αCVdd2 + KIVdd
α - activity factor. Number of transitions (per operation, e.g., in one cycle.) C - switched capacitance. Vdd - supply voltage. K I - static current component, K ~ W/L of PMOS & NMOS devices, input signal slope.
Figure 10. Equation for CMOS power consumption
40 Analysis and Design of Low Power Multipliers
CMOS Power
itch-
vices
und,
of
ell as
f
harg-
the output of a gate, and a second ’parasitic’ short circuit current. During part of the sw
ing time, when the input voltage is between the supply rails, both PMOS and NMOS de
can conduct. This results in a current flowing directly from the supply voltage rail to gro
called “short circuit current”, “crowbar current” or “totem-pole current”. The actual value
the current is a function of the conductances of the PMOS and NMOS transistors, as w
the slope of the input signal. In a well-designed circuit, this should contribute < 10% o
overall power [6].
The main source of power dissipation in CMOS, that used in the charging and disc
V in
Ishort
time
PM OS on
NM OS on
Figure 11. Short circuit current occurs when CMOS devices switch. If the input of the gate is in the region where PMOS and NMOS devices are both on, current will flow directly between the rails. Short circuit current can occur if the gate is not held close to the Vdd or Vgnd rails.
Vdd
Gnd
Ishort
Analysis and Design of Low Power Multipliers 41
ing of capacitances, accounts for 70-90% of overall power dissipation and is therefore the
main target of power optimization.
2.2.2 Power Optimization FundamentalsTo understand the prospects for power reduction in multipliers, we will describe
approaches which can be taken to reduce power dissipation in CMOS logic structures. Quite a
number of techniques are applicable to multipliers, although their practicability limits their
use in industry.
Three factors are relevant in dynamic power calculations: supply voltage, capacitive load
and activity factor. The first is the voltage to which capacitors are charged, typically the sup-
ply voltage Vdd. Methods which focus on the second, capacitance reduction, are popular as
lowering capacitance benefits both power and delay. The third target is activity reduction,
which stresses minimizing the number of times a node switches in a given period of time.
Careful design of delays in circuits is the primary method for reducing activity.
Voltage
Downward voltage scaling typically accompanies process shrinks and has a great effect in
reducing power dissipation, since power is a function of the square of the voltage
(P ~ αCVdd2). Techniques which further lower voltages have been successful in minimizing
42 Analysis and Design of Low Power Multipliers
CMOS Power
com-
e, and
f
n the
ower-
power dissipation, but voltage reduction can result in a corresponding increase in delay.
Supply and threshold voltage are inversely proportional to delay (D ~ 1/[Vdd - Vt]), so a
decrease in the supply voltage will give a lower power, but slower circuit—this can be
pensated by a reduction in the threshold voltage, Vt. However, lowering the threshold volt-
age causes an increase in power (both static and dynamic), increases current leakag
compromises noise margins. Where to set Vdd and Vt in a given process continues to be a
subject of investigation.
Several methods have been proposed for varying Vt during operation through the use o
the "backgate effect", also known as the "body effect". By putting a negative voltage o
substrate (Vsb > 0), the effective threshold can be raised thereby increasing delay and l
ing power consumption (see Fig. 12).
gatedrainsource
substrate
+-Vsb
Figure 12. Biasing for Vt modification via backgate effect
Analysis and Design of Low Power Multipliers 43
Designs are based on a low Vt, low delay and high power implementation; when one
wishes the circuit to go into "sleep mode", the backgate voltage is lowered and the circuit
becomes low power.
Unfortunately, the above technique has two problems. The amount by which Vt is reduced
is a function of the square root of Vsb, that is: Vt ~ Vt0 + γ(Vsb)1/2. Therefore, if one wishes to
strongly turn off the device, there is a trend of diminishing returns. A second problem relates
to process scaling: as transistor dimensions have gone down, the [γ] parameter in the above
equation also scales down. In successive processes generations, we see a reduction in the abil-
ity to control the threshold voltage via the body effect.
Similar methods for varying the voltage of the supply rails have been proposed and imple-
mented. The simplest of these involve having a servoing control on the value of the voltage
rails, and controlling the value appropriately based on whether the circuit has to have low
delay or low power. Other techniques, such as having portions of the circuit using different
sets of power rails have shown success[18].
Capacitance
As mentioned previously, capacitance reduction has been a primary goal of circuit design-
ers since lowering capacitance also contributes to delay reduction. For power minimization, it
44 Analysis and Design of Low Power Multipliers
CMOS Power
ate
tercon-
n is to
iving
is min-
n sig-
gate
y and
is important to consider capacitance and activity factor together. For example, a high capac-
itance section of circuit which does not switch very often may contribute less power than a
low capacitance point that experiences high switching activity.
Capacitance in integrated circuits can be classified into three groups—these are g
capacitance (inputs of devices), parasitic capacitances (internal nodes of gates) and in
nect capacitance. The basic approach for capacitance minimization in cell-based desig
create a library of various sized cells, minimizing parasitic capacitances during layout
through careful physical design. Gate capacitance is a fixed function of the desired dr
strengths of each cell. During system level place and route, interconnect capacitance
imized by placing gates close together and avoiding high amounts of coupling betwee
nal lines.
When using a standard cell gate library, it is important to note that not all inputs to a
which have the same function (e.g., NAND inputs, NOR inputs) exhibit the same dela
Analysis and Design of Low Power Multipliers 45
power characteristics. The variance in delay as a function of the inputs is well known, but
there is also an effect on power. By having certain signals arrive later than others, it is possible
to minimize power dissipation caused by inadvertent charging and discharging of parasitic
capacitances.
To demonstrate this parasitic charging effect, consider the 3-input NAND gate, shown in
Fig. 13. If one models the pulldown devices using a resistor to represent the on-resistance of
the NMOS devices, with parasitic capacitance at the drain and source of each device, we see
three points of capacitance, one at the output and two within the pulldown chain.
The charging behavior of each point depends on the arrival times of the input signals.
Consider two cases where the inputs switch from 000 to 111. In both cases, the output goes
from high to low, but in one case the parasitic capacitances are charged up prior to being dis-
Vdd
Gnd
X Y Z
X
Y
Z
Figure 13. Pulldown stack of 3-input NAND
46 Analysis and Design of Low Power Multipliers
CMOS Power
charged, whereas very little charging of parasitics occurs in the second case. (see Fig. 14).
Such input ordering can result in power savings of 10-20%, depending on values of the
capacitances[13].
Activity
A large amount of power is dissipated when a node switches several times prior to set-
tling to its final value. This is due to unequal arrival times on the inputs of devices, causing
several state transitions before the output settles to its final value.
X
Y
Z
X
Y
Z
Figure 14. Switching behavior for different input arrival times
Analysis and Design of Low Power Multipliers 47
In Fig. 15, the bottom input of the AND gate is delayed by the inverter. Although the final
value of the output is the same as the initial value, the gate output may undergo a spurious
transition, due to the delay introduced by the inverter. Such a transition is called a "glitch" or
we say the output node experiences "false switching."
Glitch minimization focuses on attempting to equalize signal arrival times at the inputs of
a gate. There are several techniques which can be applied to achieve this result. For example,
buffers can be inserted which introduce delays in fast signal paths. This causes signals travel-
ling along these paths to slow down, allowing inputs to transition at similar time points.
Another technique is to introduce a row of latches, which are then triggered at the same time,
thereby "filtering" out fast signals. Otherwise, the logic may be resynthesized, attempting to
generate paths whose delays are more "balanced" in terms of arrival times at the inputs of
gates. Note that some of these techniques modify the logic by adding elements (buffers,
latches), and one must be careful that the extra power of these devices is compensated by a
Figure 15. Output glitching.
48 Analysis and Design of Low Power Multipliers
Multiplier Power Reduction
larger amount of overall power savings.
Design for path balancing is a difficult goal due to conflicting effects when gates are
sized. Given a logic structure, one wishes to size the gates to minimize power, subject to a
delay constraint. Power can be reduced by reducing the size of gates, thereby reducing
switched gate capacitance. However, path balancing attempts to increase delay on fast paths
by reducing gate capacitance, causing risetimes and falltimes to be much longer, corre-
sponding to a slow signal slope. The problem lies in increased short circuit current as the
signal slope is reduced; both NMOS and PMOS are turned on for a longer time. Trading off
these effects is a very difficult optimization problem. Some solutions have been proposed,
for example starting from a design with gates of minimum size, and upsizing the gates based
on a static timing analysis until delay constraints are met[49]. This problem continues to be
a focus of power optimization at the transistor level.
2.3 Multiplier Power Reduction
The design of digital CMOS until recently has focused on delay reduction and it is only
recently that power dissipation has gained prominence. In multipliers, delay increases as the
size of the multiplier grows (in terms of bits, e.g., a 16-bit by 16-bit multiplier) but can vary
depending on implementation. Power is proportional to the amount of circuitry present in
Analysis and Design of Low Power Multipliers 49
the multiplier and how that circuitry is connected to perform the multiplication. The amount
of adder blocks comprising a multiplier is proportional to the square O(n2) of the size of the
multiplication. Therefore, multipliers tend to be fairly large, power consuming blocks.
To a first order, both power and delay can be minimized by using the smallest multiplier
design available. Therefore, while microprocessors use 53 and 64 bit multipliers according to
the IEEE standard, DSP multipliers have sizes in the range of 8, 16, or 24 bits. Often, the mul-
tiplier will perform a multiplication on two numbers of a certain resolution (for example, mul-
tiply two16 bit numbers), but will incorporate the resulting value in a final add of greater
resolution (e.g., 24-bits). This is a consequence of the carry-save methodology prevalent in
most methods for performing partial product reduction, as well as the need in many DSP algo-
rithms to perform numerous sequential multiply operations which are accumulated together.
Typically, power and delay minimization techniques focus on the various sub-blocks com-
prising a larger block and address power optimization in each of these independently. How-
ever, dependencies between these sub-blocks affect the overall power characteristics, and
some benefit can be gained from an integrated approach. We will initially consider basic
blocks in a multiplier separately, then describe how interdependencies affect overall opera-
tion. (An example of a power/delay reduction technique which lends itself very well to the
interaction of various sub-blocks is presented in the final chapter.)
50 Analysis and Design of Low Power Multipliers
Multiplier Power Reduction
We can divide power analysis and optimization of multipliers along the lines of the
design hierarchy. We initially focus on circuitry used to implement the logic functions, the
design of the logic functions comprising the multiplier, and the architecture of the multiplier
as a whole.
2.3.1 Multiplier Circuitry The basic circuitry used to implement the multiplier is defined according to process
technology. Currently, the vast majority of digital logic is implemented in standard CMOS.
Aggressively low power designs attempt to used adiabatic techniques[13] for low power
operation, although these compromise delay to achieve the power gain. Low swing logic is a
technique which can be usefully applied to general CMOS circuits[17][18], and has been
applied to multipliers in particular. Although a wide range of circuit techniques exist for
implementing fast and/or lower power arithmetic components [37], standard CMOS contin-
ues to be the circuit of choice[39]. Very aggressive high speed designs use dynamic
CMOS[34][36], which unfortunately is also power-intensive. Our focus will be on circuits
designed in standard static CMOS.
2.3.2 Logic Level Multiplier OptimizationSince the components comprising a multiplier (PP generation, PP reduction, final adder)
have been fairly standard, and as this decomposition is recognized as one of the best, if not
Analysis and Design of Low Power Multipliers 51
por-
ess fre-
vings.
n
par-
d then
r.
the
the best way to implement high speed multipliers, a good deal of effort has been spent on opti-
mizing the power of these components. We will describe some of these investigations.
Multipliers can be constructed which take advantage of special characteristics of the num-
bers which they are multiplying. For example, in [41], it is observed that FIR multipliers, the
coefficients used in multiplications do not change values. It is empirically observed that Booth
recoded multipliers can be implemented to be low power if the coefficients are used for the
multiplier (which is then encoded), as opposed to the data inputs. In this case, a lower number
of transitions results, due to different characteristics of the input. Similarly, transitions reduc-
tion can be achieved in 4-2 reduction trees by noting that two outputs have the same
’weight’[40]. Therefore, a degree of freedom exists in circuit implementation, which is im
tant since one of the outputs typically has greater output load. It is possible to assign a l
quently transitioning output to the more heavily loaded output load to achieve power sa
Different circuit arrangements are proposed which attempt to minimized output transitio
probabilities and their propagation through the circuit.
Booth recoding suffers from the problem that unequal delay paths exist in the Booth
tial product generator. One path goes from the multiplier, through the Booth encoder, an
to the Booth decoder, while paths from the multiplicand go directly to the Booth decode
Since the Booth decoder is typically two gates deep, a glitch can result at the output of
52 Analysis and Design of Low Power Multipliers
Multiplier Power Reduction
h is
multi-
r. This
y does
le, the
artial
ul
raged
o a
the
less,
s is
tial
ork
Booth decoder due to this greater delay path (see Fig. 5). One approach is to redesign the
Booth encoders/decoders[42] in the following manner: the Booth encoder’s logic dept
reduced and the Booth decoder is designed, such that early arriving signals (from the
plicand) have a greater delay "through" the decoder that inputs from the Booth encode
balances the signal paths and allows reduced glitching. Unfortunately, the above stud
not present data confirming these theoretical results.
The presence of spurious signal transitions is the object of much study. For examp
characteristics of array partial product reduction schemes which lead to high switching
activity are addressed in [43], which attempts to create a more regular generation of p
product bits when they are needed. This work is studied more closely in Chapter 4.
Latching and gating of signals has been studied in various contexts, with successf
application to multipliers. Certain inputs have special characteristics which can be leve
to avoid unnecessary calculations. For example, a simple case is if one of the inputs t
multiplier is ’0’ the output is obviously ’0’. Such pre-computations can be used to stop
multiplier from calculating, through the use of latches on the inputs—if the output is use
the input latches are not made ’transparent’ for these inputs, and the next set of input
considered[21]. Furthermore, the case where a bit vector of all ’0’s is added to the par
result during carry-save addition implies that useless additions are being performed. W
Analysis and Design of Low Power Multipliers 53
5],
sign,
sev-
trees
issipa-
s in
to
d-
re
cy of
g
on bypassing carry-save adders is presented in[44], which shows how power savings can be
achieved through ’0’-aware circuitry. Latching at a circuit level has also been explored[4
and we further describe these developments in Chapter 4.
2.3.3 Architectural Power OptimizationA good deal of work has been done in analyzing architectural choices in multiplier de
at several levels. At the basic level, the architecture of the multiplier can be modified in
eral ways, which is one of the foci of this research. The choice of arrays versus Wallace
has primarily been based on their respective delay properties, while the relative power d
tion merits of each implementation are unclear[37][38]. We further describe these issue
Chapter 3.
The multiplier is typically a component of a larger system, which itself is assembled
implement a particular chip architecture. The interaction of the multiplier with its surroun
ings impacts the power of the entire system, i.e., the system impacts the power of the multi-
plier by supplying the inputs to the multiplier, and the multiplier provides outputs which a
read by the remainder of the system.
An example of work that considers this issues is [19], which investigates the frequen
transitions in high versus low order bits of a multiplier. Such information can lead to usin
smaller multipliers when necessary and only doing full-precision multiplication when the
54 Analysis and Design of Low Power Multipliers
Multiplier Power Reduction
need is detected[20]. Another approach at the architectural level is to consider to what
extent operations can be parallelized. If a number of operations are independent of each
other and can be performed in parallel, multiple hardware can be implemented on-chip, to
be run at reduced voltage. Speed is nearly linearly proportional to voltage, but power is pro-
portional to the square of voltage. Therefore, if the increased number of functional units can
account for speed reduction, a savings in power can be achieved[6]. Architecture-specific
optimizations can achieve siginificant gains, at the cost of compromising generality. This
thesis focuses on techniques which have wide applicability.
2.3.4 Low Power Multiplication ResearchThe above work shows some of the areas in which low power multiplication techniques
have been developed. A basic characteristic about multiplication circuitry is that the primary
goal is to perform the operation with high speed. Multiplication variants have been devel-
oped with the express purpose of reducing the delay required to get a result. Therefore, our
main focus was to analyze these low-delay blocks.
Initially, the choice of Wallace tree vs. array partial product reduction was based on
delay characteristics, as explained above. However, it is also possible to achieve delay
reduction through high speed adders. If an overall multiplier target can be met through an
aggressive adder design, perhaps it is not necessary to use a Wallace tree, since the design
Analysis and Design of Low Power Multipliers 55
effort is greater for a this type of circuit. Furthermore, it has been suggested that Wallace trees
should be avoided for low power applications [38]. This view has been contradicted by some
early experiments which we performed, indicating that Wallace trees have much lower power
dissipation while retaining their delay advantages. Such a point of view is confirmed by sim-
ple experiments in the literature[37]. Therefore, it might be more practical to use a low speed,
low power adder, in conjunction with a Wallace tree, to meet delay targets. In these studies, an
important consideration which was largely ignored was the impact of physical design on
power and delay. Our initial work attempted to resolve these ambiguities by instantiating a
series of Wallace tree and array designs with different adders, using a simple design flow
which accounted for layout characteristics (placement and interconnect capacitance). This
work is described in Chapter 3.
The presense of spurious transitions in multipliers suggests the use of latches to reduce
switching activity. Some work in this area has been done[43][45], although it focuses on array
multipliers, which are less interesting due to their high-delay properties. We attempt to deter-
mine whether it is possible to apply latching to Wallace trees, whose lower logic depth leads
to lower delay. Several challenges exist in this area, such as very wide signal path, and fewer
spurious transitions to ’reduce’, i.e., the design is initially quite low in power. The results of
our investigation are shown in Chapter 4.
56 Analysis and Design of Low Power Multipliers
Summary
lay—
n
e
ce cir-
ircuit
nti-
ding
wer
ts
se
power
ut
e
Circuit design in multipliers has often been targeted at optimizing delay. The circuit
design techniques described above attempt to reduce power while maintaining low de
they are successful to varying degrees. We investigated a delay reduction technique i
adders, which attempts to reduce the delay by lowering the logic depth. By lowering th
logic depth, we reduced the amount of circuitry present in the adders. If one can redu
cuit count, it is possible that power will be reduced, since the total capacitance of the c
implementation has been lowered. To assess the validity of this optimization, we insta
ated adders using a library of different reduced-circuit logic blocks. A full analysis inclu
layout and detailed SPICE simulations shows that this technique is viable for lower po
multiplication. This work is described in Chapter 5.
2.4 Summary
In this chapter, we present an overview of multiplier operation, as well as the basic
power dissipating charateristics in CMOS. While false transitions and parasitic curren
seem unavoidable in CMOS implementations of multipliers, our goal is to minimize the
effects while performing the operation. Design techniques have expressly focused on
reduction in recent years, although the larger goal is to achieve power efficiency witho
compromising delay, which is much more difficult. In subsequent chapters, we describ
Analysis and Design of Low Power Multipliers 57
analysis and design of multipliers for low power, using multiplier designs to reinforce our
conclusions. Particular attention is paid to physical design characteristics, which can affect
choices made at the logic level.
58 Analysis and Design of Low Power Multipliers
3 Power Trade-offs in Array and Wallace Tree Multipliers
3.1 Introduction
This chapter describes investigations into understanding basic power dissipation charac-
teristics of partial product reduction schemes. We attempted to understand the switching
characteristics of arrays and Wallace trees and how this switching leads to different levels of
power dissipation. Determining the answer required modelling not only the logical behavior
of each style but also creating designs with a rough representation of the physical design
characteristics of CMOS implementation, interconnect capacitance in particular. We arrived
at results which refuted earlier crude analyses of these structures. Results were published in
[46].
Analysis and Design of Low Power Multipliers 59
tial
he
blocks
deeper
in
ly
esign
wire
n
he
er.
3.1.1 Partial Product Reduction SchemesAs discussed in Chapter 2, two competing schemes are commonly used to perform the
partial product reduction step: arrays and Wallace trees. In the array structure, rows of partial
products are added incrementally, resulting in n total adds of n bits wide each. The final fast
add is also of size n. In the Wallace tree scheme, rows are added in parallel, so that at each
‘level’ in the process, m rows are reduced to rows. The total number logic levels for par
product reduction is . However, the final add is approximately size 2n - .
The array method is very popular as it lends itself to a clean VLSI implementation. T
structure can be laid out in a regular array, where components communicate with other
which are placed at adjacent locations, resulting in very few long wires. However, array
designs tend to be slower than those using the Wallace tree method. This is due to the
logic depth of arrays, which sets a lower limit on the maximum delay of the circuit, even
the presence of gate upsizing. On the other hand, the Wallace tree logic network is high
irregular, necessitating a custom placement and routing phase. Furthermore, physical d
of the Wallace tree can create long wires whose additional capacitance causes greater
load. Despite these drawbacks, the Wallace tree structure is the partial product reductio
method of choice in high-performance designs [28]. While it is generally accepted that t
Wallace tree is faster, it was not entirely clear which of these two designs is lower in pow
23---m
n3 2⁄log n3 2⁄log
60 Analysis and Design of Low Power Multipliers
Introduction
the
ir-
t the
gn
inter-
es, due
s.) In
on the
hich
As we began this study, we found some disagreement among the few sources who spec-
ulated on the power characteristics of arrays versus Wallace trees. Bellaouar and Elmasry
[38] suggest that Wallace tree styles are best avoided for low power applications, since the
excess wiring was likely to consume extra power. On the hand, Callaway and Swartzlander
[37] demonstrated quantitatively that switching activity within just the partial product
reduction hardware was substantially better for the tree over the array—if one ignores
wires completely. Montoye [30] alluded to the difficulties in dealing with Wallace tree w
ing for one high-performance commercial processor (IBM RS/6000), but concluded tha
speed gain is worth the trouble. Due to the complexity of the wiring problem, this desi
used counters larger than the basic CSA (3-inputs, 2-output) to reduce the amount of
connect. Unfortunately power was not a concern in this design.
A low power implementation of Wallace trees might be constructed using minimum
sized devices (to reduce gate capacitance), although speed would dictate larger devic
to the extra capacitance introduced by long wires (when contrasted to array multiplier
this case, not only is power an issue, but we have to be aware of interconnect effects
overall delay.
3.1.2 Analysis of Switching BehaviorThe main difference between array and Wallace tree structures is the method in w
Analysis and Design of Low Power Multipliers 61
data is processed. Array structures incorporate carry-save additions in a sequential manner,
whereas Wallace trees perform carry-save addition in a parallel manner. In both cases, the
number of adders is roughly equal, as the bit compression procedure using full adders (3 input
bits, 2 output bits) does not vary whether one performs the procedure sequentially or in paral-
lel.
Although the linear versus logarithmic delay properties of these two styles are clearly
understood (they are a direct function of the logic depth), the reason for different power dissi-
pations is a bit more subtle. Signal flow is described based on the example shown below (Fig.
16).
For this example, we assume simple adder operations (i.e., ignore the carry, etc., assume
simple one-bit operation,) and we assume unit delay through all adders. In each design, one
clearly sees the effect of logic depth on delay. Switching behavior can be deduced from exam-
adder1
adder2
adder3
addera adderb
a b
c
dadderc
a b c d
result
result
Figure 16. Sequential vs. parallel layouts for logic blocks.
62 Analysis and Design of Low Power Multipliers
Introduction
s at the
ly
ching
erties,
se
ining when the inputs cause switching to happen in downstream logic.
In the case of the sequential logic, where all inputs arrive at the same time, we see that at
time = 0, inputs ‘a’ and ‘b’ are added, and this causes the output of adder1 to toggle in the
next time frame. At the same time, ‘c’ is added to the initial output value of adder2 and its
output toggles in the next time frame. Similarly with the output of adder3. At time = 1, ‘c’ is
added to the new output of adder1, and the output of adder2 toggles again. Similarly with the
output of adder3. Finally at time = 2, the value at the output of adder2 is added to ‘d’ and the
final result is determined.
In the parallel case, at time = 0, the outputs of addera and adderb are calculated. At time
= 1, the output of adderc is determined and no more switching occurs.
From a switching perspective, we can see that for sequential arrangements, adder
nth level of logic switch n times, whereas for parallel operation, every adder switches on
once. Therefore, parallel organization of adders has a beneficial effect of reducing swit
activity.
Partial product reduction styles are only slightly more complicated since the adder
blocks generate both carry and sum bits. These two signals have different delay prop
so that both array and Wallace tree reduction networks experience glitching due to the
Analysis and Design of Low Power Multipliers 63
expe-
s tend
glitch-
ion.
unequal delays. Furthermore, the delay out of any given input is a function of the input vec-
tors; certain input patterns will cause the output to switch faster than other inputs. Finally, the
Wallace PP reduction tree is not always as symmetrical as Fig. 16 implies. In some cases, sig-
nals ‘skip’ a level, as shown in Fig. 17
In conclusion, glitching originates from several sources and even Wallace tree styles
rience these spurious transitions. Nevertheless, array partial product reduction structure
to experience a much greater amount of false switching than Wallace trees. For arrays,
ing due to unequal delay of input signals based on logic depth is the primary considerat
(a) (b)
Figure 17. Signal flow in arrays vs. Wallace trees. (a) In arrays, inputs are present aevery level of logic depth, so digital circuits at deeper logic levels experience more switching. However, the carry-save adders are arranged in rows, so signals tend to flow in "waves" down the logic. (b) Wallace trees have inputs at one logic level, so input data arrives in parallel and flows downward. However, some connections "skipa logic level and so input arrival times tends to be skewed at deeper logic levels.
level "skipped" level "skipped"
64 Analysis and Design of Low Power Multipliers
Introduction
nce
anal-
naly-
ltipli-
3.1.3 Initial InvestigationsIntuition suggests that the log-versus-linear depth of the reduction network for the Wal-
lace tree might well lead to shorter propagation paths and less power-consuming glitching.
We performed a first order analysis of the average transition counts across sets of random
vectors applied to both array and Wallace tree designs. This was a Verilog model which used
unit delay estimates to model the carry and sum delay differences. We assigned a delay
value of ‘2’ to the carry, and a value of ‘3’ for the sum, mimicing the rough delay differe
in a typical CMOS implementation. Input capacitances were not incorporated into the
ysis. The results are shown in Fig. 18
Interestingly, a very similar figure appears in [37]. That work presented a detailed a
sis of power for various adder forms, and then speculated on power dissipation in mu
0
20000
40000
60000
80000
4 6 8 10 12 14 16
Avg
. Num
. Tra
nsiti
ons
(100
tria
ls)
Multiplier Size (bits)
Array Multiplier
Wallace TreeMultiplier
Figure 18. Transition count comparison for multipliers.
Analysis and Design of Low Power Multipliers 65
ers by performing an analysis similar to the one described above. Reflecting our initial
hypothesis, the authors observed that Wallace trees should have a significant power advan-
tage.
Unfortunately, these analyses are incomplete due to a lack of consideration of wiring
effects. Since a strong point of arrays for both delay and power is the absence of significant
wiring capacitance, the above graph does not really resolve the debate, and therefore is unable
to support or refute the claim of Bellaouar and Elmasry[38].
To remedy this, we devised a more detailed evaluation model that considers not only the
gate-level differences, but also the wiring effects due to layout. As we shall see, experiments
suggest optimism for the Wallace trees: they are neither as bad as [38] suggests, nor as good as
[37] estimates, but within the limits of our model, the tree style appears somewhat better on
both energy consumption and on speed. We will describe our modeling methodology, the lay-
out strategy for comparing arrays and trees, and our experimental results.
3.2 An Improved Analysis methodology
Determining the effects of interconnect on delay and power requires a more detailed
model of the logical behavior and physical characteristics of our devices. The logic structure
was refined through the use of SPICE models to derive delay and power numbers at the circuit
66 Analysis and Design of Low Power Multipliers
An Improved Analysis methodology
level. This gives us detailed data which then can be used in gate-level simulation to deter-
mine more exactly transition behavior. The basic analysis methodology was initially devel-
oped for these experiments, and subsequently refined as the research progressed. To that
end, we will only relate the features which are relevant to this work. The comparison meth-
odology comprised several stages.
3.2.1 Generation of Component Library Logic gates were represented using SPICE transistor descriptions. We worked from fun-
damental blocks such as full-adders, half-adders and AND gates. This allowed us to lever-
age HSPICE and other simulators to develop characterized data for the delay and power
dissipation of our basic blocks. The number of basic CMOS blocks necessary to implement
a multiplier is fairly small, approximately 5-10, depending on how aggressive a design one
wishes to implement. For this initial work, we simulated only the full adder (CSA), half
adder (HA), AND and XOR gates and inverters. The size of the basic logic blocks was ini-
tially fairly coarse; for example, we represented a 3-input, 2-output full adder in a single
block.
An alternative choice would have been to characterize the smallest CMOS gates possi-
ble. For example, the full adder is composed of a carry stage, a sum stage and two inverters.
The choice of doing basic simulation on complex or simple gates has implications for the
Analysis and Design of Low Power Multipliers 67
accuracy of our delay and power measurements, and will be discussed in the following sec-
tion.
This initial work was done using parameters from the Hewlett-Packard 0.8µm CMOS pro-
cess (CMOS26G), with the designs assuming a 3V power supply.
3.2.2 CircuitryThe basic building block in all of our experiments is the full adder, which is used in the PP
reduction stage and the final adder designs.Static CMOS logic blocks were based on standard
designs, which can be found in [1]. The most commonly used implementation is shown in Fig.
19. The popularity of this particular design comes from the frugal use of transistors in imple-
menting both the carry function and the exclusive-or (sum) function: 28 transistors are used,
b
carry
a
c a
b
ac
a ba c
ba c
a
b
c
a
b
c
sum
Figure 19. Full adder, also called carry-save adder, implemented using the ’28T’ construction.
68 Analysis and Design of Low Power Multipliers
An Improved Analysis methodology
al,
E
out-
nges,
e
ed
that
moved
imu-
calcu-
which
e of
hence the designation “28T” cell. Note that the carry signal is faster than the sum sign
which might be initially be counter-intuitive from an arithmetic standpoint.
3.2.3 Cell CharacterizationIn this initial version of the design flow, our cells were characterized through SPIC
simulations of major logic blocks such as full-adders, half-adders and AND gates, with
put loads of 0 fF and 100fF. Delay and power were calculated for all possible input cha
in the worst case requiring O(22n) simulations for complete characterization. In practice, w
only simulated input vectors which caused the output to change. Delays were measur
from the input 50% point to the output 50% point. Power dissipation was of two forms:
consumed by the circuit switching, and that consumed as charge was delivered and re
from the output load. Data was not calculated for different input slopes. During logic s
lation, delay and power were computed as a simple linear interpolation from the data
lated for the two output loads. The actual output load was a combination of gate load (
could be determined from the interconnect network) and wire load (which required the
physical design step.)
Glitching: Simulators generally determine what change of inputs will cause a chang
outputs, and execute characterization runs for these input stimuli. Unfortunately, some
inputs generate a single glitch (e.g., output goes 0->1->0), and therefore a simulator can
Analysis and Design of Low Power Multipliers 69
es
miss the fact that a glitch may appear on the output and be propagated to downstream logic. In
fact, a glitch does occur for particular set of inputs to the full-adder. However, this glitch is
fairly small and generally gets filtered out fairly easily.
Accuracy: Characterizing different-sized blocks of logic using SPICE can be desirable or
harmful in several ways. If the basic blocks that we characterize are basic CMOS gates (sin-
gle-output stacks of transistors), we achieve a high level of detail, but we maximize the num-
ber of blocks that we have to deal with in a complete, assembled systme. On the other hand, if
the basic blocks we characterize are fairly larger than CMOS gates, we can lose some detail in
circuit behavior. Furthermore, such blocks generally require a larger input count. Since the
number of simulations for block characterization is a function of input count (in our case, an
exponential function of the input count), we would like our atomic blocks to be as small as
possible.
However, there is an advantage in simulating larger blocks. Generally when glitches
occur, it is very difficult to determine the energy dissipated since node voltages often do not
swing completely rail-to-rail. In fact, if the output glitches and does not reach the complemen-
tary rail before switching back, an event driven logic simulator may not count this glitch as an
‘event’. When simulating large blocks using a detailed circuit simulator, all internal glitch
are correctly accounted for.
70 Analysis and Design of Low Power Multipliers
An Improved Analysis methodology
tors
As
n runs,
ge for
etlists
rays,
ure
Due to these and other considerations, we initially decided on characterizing basic
blocks such as full adders and MUXes as single entities. Some of these logic blocks are
comprised of several independent CMOS gates (e.g., the full-adder is made up of 4 gates.)
Timing: When characterizing cells, we generally simulate for one or more inputs
switching at a particular point in time. It is very hard to account for the effect of inputs
switching concurrently, but starting at different times (i.e., one input begins switching, then
before it is finished, another input begins switching.) To accurately treat this effect, we
would need to run a much larger number of characterizations. We settled for running what
we considered an empirically “adequate” number of simulations: all possible input vec
which caused an output to switch.
Characterization for a cell library takes approximately 1.5 hours for our basic cells.
we developed more accurate models, we required a greater number of characterizatio
and this time increased.
3.2.4 Partial Product Reduction GeneratorsUsing full-adders and half-adders, we constructed the partial product reduction sta
both arrays and Wallace trees. This assembly method was devised from examining n
in [1] and was very straightforward; no optimization was attempted in this stage. For ar
it is possible to incorporate placement information during assembly, since array struct
Analysis and Design of Low Power Multipliers 71
rule
uitry
essen-
olved
in to
gular
t the
f the
implies a regular layout. For Wallace trees, the structure was undetermined until a later place-
ment stage.
3.2.5 Adder GeneratorsIn this stage, we examined three types of adders: the ripple adder, the carry-save adder and
the carry-select adder. These were chosen because they represent a range of delay optimality.
It has been established in [37] that there exists a very clear energy-delay trade-off in adder
designs, with faster adders consuming more energy per operation. Delay in adders is reduced
by incorporating circuitry which performs “lookahead” or prediction of calculations. As a
of thumb, the more circuitry which is applied to lookahead, the lower the delay. More circ
also implies more capacitance and therefore more energy dissipation. Our experiments
tially confirmed this view.
Adder generators are also fairly straightforward. Although there are sizing issues inv
in some of the adder models, we only implemented simple minimum sized circuits, aga
determine characteristics of minimum power configurations.
3.2.6 Layout Model and Wallace Tree PlacementTo evaluate complete multipliers, we require a common layout model for both the re
array and the Wallace tree. It is important here to avoid penalizing one style artificially a
expense of the other. We used a simple unit grid model, in which each component cell o
72 Analysis and Design of Low Power Multipliers
An Improved Analysis methodology
multiplier netlist occupies one slot in a set of standard cell-style rows (see Fig. 20). We
made all logic blocks unit sized, to facilitate placement. We assume over-the-row routing of
vertical wires connecting cells in different rows, but we conservatively estimate the impact
on total height and wire length of the wires that must make horizontal jogs in each wiring
channel. A post-process global router embeds wires into the placed model, and the maxi-
mum density of each channel is used to derive channel height and final wirelength esti-
mates.
For the array, we procedurally tile the regular partial product reduction cells into the
grid, with the final reduction adder carefully located at the bottom of the array. Since nearly
input pins
output pins
component rows
channels
Figure 20. Multiplier layout model.
global routes
Analysis and Design of Low Power Multipliers 73
all the connections in the array are nearest neighbor between cells in the partial product reduc-
tion network, the array fits well into this model. On the other hand, the Wallace tree requires
constructive placement and routing of its partial product generation (AND gates), reduction
network ((3,2) carry save adder tree) and final reduction adder. We use a simple annealing-
based placement strategy [47] that strives to minimize the overall wirelength while densely
packing the grid.
Finally, given a complete netlist and a real, although highly simplified placement, we can
extract back per-net capacitance and determine estimates of both power and delay for each
complete multiplier.
3.2.7 Power and Delay EstimationInitially, we developed Verilog files for the various multiplier architectures, where power
was computed by counting the number of transitions which occurred at each cell output, as
described above. For our more refined design methodology, we developed a simplified, cus-
tom logic simulator for these designs. The simulator is an event-driven evaluation engine,
where drivers send signals to receivers. If the value of a receiver is unresolved when a new
driving signal arrives, this corresponds to a potential glitch, and the previous driving signal
may be preempted. Since the filtering of small glitches from the event queue can cause power
estimation to be inaccurate (i.e., too low), power consumed during glitches is also estimated.
74 Analysis and Design of Low Power Multipliers
Experimental Results
The accuracy of this method was variable, and the inaccuracy grew with the multiplier
size. Although comparisons between our custom logic simulator, operating on characterized
cell-level netlists with back annotated wire delays, against transient device-level HSPICE
simulation on small (4 bit) multipliers showed results within 5% of HSPICE estimates, we
observed that estimates for larger multipliers were more inaccurate. Our later work concen-
trated on improving the accuracy of our estimates.
In practice, what interests us is not the absolute accuracy as much as the relative accu-
racy of estimates between designs. For example, we seemed to consistently overestimate the
energy dissipated in our designs; this overestimation grew as the multiplier size grew. How-
ever, this overestimation was fairly consistent across all types of multipliers at a given size.
The relative difference in power between to multiplier designs at the same size, was in most
cases very close to the real relative difference, as calculated by HSPICE. This will be
described in more detail in later chapters.
3.3 Experimental Results
We explored 18 different multiplier implementations in all: two different architectures
(array versus Wallace tree), three different final reduction adders for each multiplier (carry
select, carry skip, carry ripple), and three different word widths (8, 16 and 24 bits).
Analysis and Design of Low Power Multipliers 75
3.3.1 Layout CharacteristicsTable 3 and Table 4 show the area estimates for each multiplier. Recall that wiring impacts
the overall area because of the density estimates for each wiring channel. Unsurprisingly, the
array multipliers are always smaller due to their much more local wiring. Also, the ripple-
adder versions are always the smallest, again due to less adder hardware and more local wir-
ing in these adders. As a point of comparison here, [32] describes a 6 bit array multiplier as
part of an 8-tap FIR filter that occupies roughly 0.4 mm2 in a 0.8µm CMOS process; this sug-
gests that our area estimates are reasonable.
3.3.2 Energy Per OperationFig. 21 shows estimated average energy per multiply operation for each of the multipliers.
Results suggest a fairly consistent 10% energy advantage for the Wallace trees, across the
Table 3. Array multipliers - estimated area (mm2)
Adder Type 8 bit 16 bit 24 bit
Carry Sel. 0.668 2.195 4.579
Carry Skip 0.627 2.099 4.430
Ripple 0.532 1.913 4.156
Table 4. Wallace tree multipliers - estimated area (mm2)
Adder type 8 bit 16 bit 24 bit
Carry Sel. 0.759 2.626 5.625
Carry Skip 0.736 2.537 5.576
Ripple 0.725 2.488 5.576
76 Analysis and Design of Low Power Multipliers
Experimental Results
three bit widths examined. Despite the larger amount of irregular wiring, the shallower par-
tial product reduction in the Wallace tree appears to be advantageous for power.
3.3.3 DelayFig. 22 shows estimated delay for each multiplier. In general, the delays are fairly long
since we use a 0.8µm CMOS process, and because all devices are of minimum size. For the
small 8 bit design, results of semi-exhaustive simulation over all pairs of inputs appears in
Table 5, along with more conservative static timing estimates for comparison. That is, we
simulated all possible inputs (n inputs, 2n simulations) as simulating all possible input tran-
Figure 21. Estimated average energy per multiply op.
0
200
400
600
800
0 10 20 30
Multiplier Size (bits)
En
erg
y p
er o
per
atio
n (
pic
oJo
ule
s)
Array/C.Select Adder
Array/C. Skip Adder
Array/Ripple Adder
Wallace Tree/C. Select Adder
Wallace Tree/C. Skip Adder
Wallace Tree/Ripple Adder
Analysis and Design of Low Power Multipliers 77
sitions was deemed to be impractical (22n simulations).
The static timing estimates are pessimistic by 30-50%. The greatest discrepancy occurs
with the carry-skip adder, which has many false paths, due to carry propagate prediction cir-
cuitry. Note that the carry skip adder does not show a speed advantage at low bit width, due to
the lookahead circuitry. Without this circuitry, the carry skip adder behaves like a ripple adder.
Also note that the use of a ripple adder completely negates the advantage of using a Wallace
tree, as expected.
Fig. 22 offers both a pessimistic (static timing) and an optimistic (worst case encountered
during simulation of random patterns) timing estimate for each multiplier. The wide overlap
of the array and Wallace tree timing intervals certainly suggests that the Wallace trees are at
least competitive in delay. Indeed, the intervals for each Wallace tree cover smaller delays
than the corresponding array interval, which also suggests that the trees are faster, again as
expected.
Table 5. 8 bit multiplier - estimated delay (ns)
array Wallace tree
exhaustive simulation
static timing
exhaustive simulation
static timing
Carry sel. 16.12 22.73 14.86 18.90
Carry skip 18.93 28.93 18.42 27.25
Ripple 18.79 27.94 18.81 26.47
78 Analysis and Design of Low Power Multipliers
Experimental Results
•delay
ed by
Taken together, Fig. 21 and Fig. 22 suggest that the Wallace trees have an energy
advantage over the regular array multipliers. Wallace trees are not so bad as suggest
[38], nor as significantly superior as estimated by [37].
Figure 22. Estimated worst-case multiplier delay (ns).
X X
X
X
X
X
v v
vv
v
v
l l
l
l
l
l
∆∆
∆
∆
∆
∆
u
u
u
u
u
m
m
m
m
m
0
20
40
60
80
100
120
Xvl∆um
Ripple Adder - (sim. estimate)
8 bit 16 bit 24 bit
Carry Skip Adder - (sim. estimate)Carry Select Adder - (sim. estimate)
Carry Select Adder - static timingCarry Skip Adder - static timing
Ripple Adder - static timing
um
Del
ay (
nano
seco
ndds
)
Arr
ayW
alla
ce
Arr
ayW
alla
ce
Arr
ayW
alla
ce
Analysis and Design of Low Power Multipliers 79
m
on
lation
mple,
pe
e
unt
istance
, at a
3.3.4 Further Modelling RefinementsFor the purpose of architectural comparison, we developed a simple modelling methodol-
ogy which allowed rough comparisons of design styles. Our estimates are based on a custom
logic simulator, which was necessary due to the large amount of time required to run HSPICE.
The layout details were coarse so as to be able to generate and compare large sets of designs.
Our goal for subsequent experiments (in Chapters 4 and 5) was to refine the methodology, and
to improve the accuracy of simulation results and layout detail.
Simulation values use HSPICE as a ’golden standard’. A main limitation of our custo
simulator was that it determined delay and power values using interpolation based only
two data points (delay with output load 0fF and 100fF.) We needed to upgrade our simu
environment to be able to interpolate among an arbitrary number of data points (for exa
6 loads, ranging from 0fF to 60fF.) Furthermore, we did not initially account for input-slo
effects on delay and power. Again, modifications were planned, to allow for interpolation
from an arbitrary number of points (we determined that the majority of input slopes rang
from 100ps to 2ns).
A second source of inaccuracy arose in our simple layout model, which did not acco
for unequal cell sizes. This tended to throw off area estimates, as well as interconnect d
estimates. The error is mitigated by the fact that array and Wallace tree implementation
80 Analysis and Design of Low Power Multipliers
Summary
ular
first-
pable
al-
given size with the same type of final adder, have nearly exactly the same count of full- and
half-adders, as well as additional circuitry. Nevertheless, it is quite possible that internal
wiring lengths, which are based on the size of these blocks, will be off as a result. In later
experiments, we determined that we should estimate cell sizes based on transistor counts,
and use these estimates during layout.
Finally, our use of global wiring data to determine wire capacitance lacked detail. A
desirable refinement would be to incorporate detailed wiring capacitance numbers. This
may well be a requirement in very deep submicron technologies, where wire capacitance
may be a dominant factor. Although it is difficult to understand this effect without imple-
menting a full routing-and-extraction methodolgy, we should verify this result. For our final
experiments, we implemented a few designs using the Cadence tool flow, allowing us to
extract and simulate complete circuits—these are discussed in the next two chapters.
3.4 Summary
By introducing a simple unit grid layout model, we have been able to compare reg
array and Wallace tree style unsigned multipliers over bit widths 8 to 24 bits, including
order delay and area effects due to physical wiring. The model is clearly coarse, but ca
of making basic predictions for area, for average power, and delay. Interestingly, the W
Analysis and Design of Low Power Multipliers 81
lace trees fare rather well, despite their irregularity and excess wiring. The smaller depth of
their partial product reduction hardware seems to offset the power lost in the wiring, offering
improved energy and delay. We believe this preliminary result justifies closer investigation
with more refined models of the problem, e.g., use of more aggressive exact critical path anal-
ysis[48], and better layout optimization, to determine more accurately the advantages of the
Wallace tree style.
82 Analysis and Design of Low Power Multipliers
4 Minimizing Switching Activity By Latch Insertion
4.1 Introduction
The large amount of power dissipation due to false switching in multiplier designs leads
to the question of to what extent one can remove or minimize this false switching by altering
the logic structure. The focus of glitch elimination should be the partial product reduction
tree, as this logic comprises over 50% of the total gate count.
In this chapter, we consider the use of latches inserted into the logic structure to delay
early transitioning signal lines, and attempt to determine their effect on different multiplier
schemes. Our initial interest in the idea of latch insertion was motivated by prior work
which applied this technique to array multipliers[43][45]. We wished to determine whether
latches can be successfully used in Wallace trees. We developed a latch insertion methodol-
Analysis and Design of Low Power Multipliers 83
ogy which targeted portions of the multiplier with high switching activity, while attempting to
minimize the power dissipated by the extra latches. Attempting to balance the power saved
versus the extra power dissipated is very difficult. Our results showed that although switching
is reduced through the use of latches, various characteristics of the Wallace tree structure ren-
der the gains minimal when the overhead of the latch reduction circuitry is included.
4.2 False Switching in Multipliers
The root cause of false switching in CMOS logic is due to input signals arriving at differ-
ent times, thereby causing the output of a logic gate to transition several times before settling
to its final value. We are interested in glitching caused by the following delay effects.
a) input paths of unequal logic depth - in this case, one signal path has to traverse more
gates than another signal path, before arriving at the input. If all gates have the same delay, the
gate with the deeper logic depth will arrive later (Fig. 23a).
b) input paths with high delay gates - if a path consists of gates which have a large delay,
signals traversing this path will be delayed, even if they have the same logic depth as other
paths (Fig. 23b).
c) delay caused by input dependent delay - for example, if a conventional two-input
NOR has its inputs switch from 00 to 01, the fall time will be longer than if the input had
84 Analysis and Design of Low Power Multipliers
False Switching in Multipliers
lse
aral-
ns by
switched from 00 to 11.
d) glitching which is propagated from upstream - if the input to a gate switches several
times, the output may also switch several times. This is particularly true if there is signifi-
cant delay between switch events. Note however, that if the transitions occur fairly close
together, or the input is a non-controlling input, the output may not toggle (in effect, a gate
“filters out” a glitch.)
4.2.1 Input LatchingIn Chapter 3, we described how array multipliers encounter a greater amount of fa
switching than do Wallace tree multipliers. Whereas all the inputs to adders arrive in p
lel, the additions themselves occur in a sequential manner, requiring multiple calculatio
the same adder block.
highdelay
lowdelay
highdelay
lowdelay
Figure 23. Forms of glitch-inducing delay
(a) (b)
Analysis and Design of Low Power Multipliers 85
One approach to alleviating this problem is to make the inputs available only when they
are needed; that is, to delay inputs which arrive earliest, until the latest available input is
present. In this manner, the logic blocks see all inputs change at approximately the same time.
In the array multiplier, it is evident that a great number of partial product bits should be
delayed, since it is not necessary that they all arrive in parallel. Specifically, the PP bits which
feed into later adders should be more delayed. This is the approach suggested by Lemmonds
and Shetti[43] and is depicted in Fig. 24. The delaying of the input bits is performed using a
multiplier multiplier
booth decoders
booth encoders latches
Figure 24. Using latches to re-time the generation of partial products (Booth).
(each is triggeredat different times)
86 Analysis and Design of Low Power Multipliers
False Switching in Multipliers
on all
logic
series of latches, which are each timed to delay early arriving input bits, so that their arrival
times are synchronized with later arriving inputs. This can be applied to basic PP bit genera-
tion, as well as Booth encoded multipliers.
The results from [43] indicated that power in a 16 bit Booth recoded multiplier could be
reduced by up to 40%. Note that there are paths that go from the input of the multiplier to
output, but do not encounter a latch. Since we are slowing down the excessively fast paths in
this scheme, we should theoretically be able to generate a multiplier which does not have a
delay greater than the non-latched version.
This type of latching addressses false switching caused by a) unequal logic depth, but
does not deal with b) high delay gates, c)input dependent delay or d) propagated glitches.
Insofar as glitches generated by a) are reduced, some effect is felt in reduced d) propagated
glitches.
4.2.2 Latching the Signal PathThe previous work equalized the delays at all the primary inputs, then performed the
caluculation. An alternative which reduces glitching more generally is to insert latches in
the signal path at deeper logic levels, to ensure that all sources of unequal input delay are
“equalized.” In this method, blocks whose delays are equalized have latches present
inputs. When a new multiplication is started, the latches are initially closed. For each
Analysis and Design of Low Power Multipliers 87
logic
k, but
eans
is
differ-
nals.
atch
priate
block, the delay of the latest arriving signal is calculated, and each latch is then triggered with
a signal that arrives when the latest arriving input is steady. This is described in Fig. 25.
Several implementation details are relevant. A “row” of latches is inserted at a given
depth all across the signal path—not only do latches equalize delay at a given logic bloc
outputs from adjacent logic blocks are also synchronized to the given logic block. This m
that an equalization effect is seen at later logic blocks.
The latch triggering signal can be implemented by a chain of inverters, whose length
determined by the required delay of the triggering signal. In the case where latches use
ent triggering signals, this chain of inverters can be used to generate several timing sig
The signal propagates down the inverter chain and is incrementally delayed, so that a l
which is to be triggered at a particular time simply connects to the delay line at the appro
trigger
Figure 25. Using latches to equalize signal arrival times in the signal path (Transition Retaining Barriers.)
88 Analysis and Design of Low Power Multipliers
False Switching in Multipliers
the
ng an
ropri-
ughly
is
is
limi-
location. In this manner, power is conserved by using the same timing signal for all latches.
In practice, for reasons which will be discussed later, we prefer to trigger latches in parallel.
This is feasible for array multipliers, whose regular layout suggests using a bank of latches
which are triggered off of the same signal. To achieve this, the triggering signal generated
by the chain of inverters is buffered to drive all the latches in parallel.
4.2.3 Previous WorkThe above approach was investigated by Musoll and Cortadella in [45]. The latching
structure is incorporated into the full-adder circuitry by putting ‘enable’ transistors into
pull-up and pull-down paths. This type of logic is called Clocked CMOS or C2MOS, of
which a description can be found in [1] (see Fig. 27). Using this technique avoids addi
explicit latch (an extra logic stage) but does introduce delay if the devices are not app
ately upsized. The authors define the use of this structure as implementing a transition-
retaining barrier (TRB). The methodology for determining the value of this technique is
straightforward. Since array multipliers consists of rows of adders which operate at ro
the same time, the objective is to find the row of adders in which to place the TRBs. It
also possible to insert more than one row of TRBs.
The effect of TRBs for array multipliers is shown in Fig. 26. Initially, false switching
a linear function of logic depth. Where TRBs are inserted, false switching events are e
Analysis and Design of Low Power Multipliers 89
nated for devices at that point in the PP reduction stage. Furthermore, false switching events
which would have been propagated to later stages are also removed. The location of the row
of TRBs is important, since locating them too early or too late in the PP reduction stage
reduces the amount of gain which is achieved.
The conclusion of this research was that it is possible to reduce spurious transitions in a
32-bit multiplier by up to 30%, while incurring an 8% delay in penalty. Generally, more TRBs
8 16
8
4
8 16
8
4
4 16
8
4
14 16
8
4
Figure 26. The false switching in an array is linear in terms of the logic depth. a) For a 16 bit multiplier, the false switching at logic depth 16 is approximately 8 toggles/operation. b) Inserting a TRB saves some of the switching (gray box.) c) The TRB should not be inserted too early or d) too late, as this lessens the amount of false switching that is elminated.
(a)
(b) (d)
(c)
logic depth
num
ber
of tr
ansi
tions
false switchingeliminated
90 Analysis and Design of Low Power Multipliers
False Switching in Multipliers
rrent
lution
tu-
yield a greater reduction in energy dissipation but also cause greater delay. Furthermore,
larger multipliers have more false switching and therefore more energy reduction can be
achieved with respect to the base case.
4.2.4 General Principles of TRB Insertion
Basic circuitry
There are several methods of incorporating latching behavior in CMOS, e.g., latches,
flip-flops, pass transistors. Our goal is to use latches which are lowest in power, while mini-
mizing delay effects. The use of C2MOS logic has the advantage that no additional logic
stages are introduced. Unfortunately, incorporation of another transistor in the pullup and
pulldown stacks of transistors increases delay due to increased pullup/down resistance. This
can be mitigated by increasing the width of all transistors in the pullup/down paths.
Another problem with latching occurs when the latches are closed for extended periods
of time. When the latches are turned off, the logic at the output of the latches is not being
driven (there is no conducting path from the gate input to the rails). This condition is known
as a “floating” gate, and if the voltage on this gate drifts, this can cause short circuit cu
to flow in the driven gate. Note that the latches in [45] do not address this issue. The so
is a pair of inverters in an SRAM configuration, one of which is weak (Fig. 27c). Unfor
Analysis and Design of Low Power Multipliers 91
ore
nately, this increases the power dissipation of the latch.
We investigated alternative designs to incorporate latching behavior. One method was
using a simple pass transistor with the back-to-back inverters for holding the state. This has
the advantage that we can eliminate a great deal of gate capacitance (capacitance on the input
of the latch). Unfortunately, pass transistors are good for passing either a high signal or a low
signal, but not both; this causes unequal rising and falling delay. Another technique was to use
a pass gate (Fig. 27d) —this alleviates the problem of unequal rise/falltime, but uses m
power. However, the use of a pass gate is superior to C2MOS, in that it has slightly less on-
in
clk
clk
out
in out
(a)
(b)
in
clk
clk
(c)
(d)
clk
clkin out
Figure 27. Incorporating latching behavior into an inverter. (a) Inverter (b) C2MOS version of inverter. (c) Incorporating state preservation. (d) Using a pass gate.
weaker
92 Analysis and Design of Low Power Multipliers
False Switching in Multipliers
resistance for the same size gates (since two transistors in parallel are in the conducting
path.) We used the pass-gate with back-to-back inverters in our experiments.
Latch timing signal
The latches are triggered by a signal which is supplied at the appropriate time. As men-
tioned above, this signal should arrive when the latest arriving input to the adder blocks
transitions. To generate this signal, the triggering time is determined by static delay analysis,
and a chain of delay elements is constructed to replicate this delay time. These delay ele-
ments should be as power efficient as possible.
A simple method of providing a delay is to use back-to-back inverters. More efficient
designs might be current starved inverters, which are more power-efficient in generating
delays. These have several problems: in the case where we wish to tap the delay line at vari-
ous points, there is not a fine resolution in delay times to choose from. Secondly, there may
be manufacturing issues which cause the delay time to vary, and this can be exacerbated in
current-starved inverters. If the latches are triggered too early, the latches will provide less
power reduction than desired; if the latches are triggered too late, the multiplier will experi-
ence greater delay than expected.
The fact that the delay elements consume power implies that there will be a bias towards
Analysis and Design of Low Power Multipliers 93
placing the latches shallower up in the tree (at a lower logic depth), since placing them at
deeper logic depth involves using more delay elements.
Note that having a delay element (inverter) drive several latches can cause the delay ele-
ment to see a very high load capacitance. This will cause the delay of the inverter to be large,
since the rise/falltime of the output is now very long. This effect can be overcome by upsizing
the delay blocks, or for very large loads, having series ratioed inverters [2] or buffers drive the
latches (see Fig. 28). Upsizing inverters will then add more gate capacitance, which will
increase power. However, having signals being driven strongly means their rise-falltimes are
very sharp, and therefore the short-circuit component of power dissipation is minimized, as
described in Chapter 2, Fig. 11.
clk
latch latch latch
clk
latch
latch
latch
Figure 28. Triggering of latches. A chain of inverters may be used to generate the delay signal. If all latches are driven in parallel a) the final signals should be buffered. Otherwise, b) the delay chain can be used unbuffered, assuming the load is not so great.
(a) (b)buffers
94 Analysis and Design of Low Power Multipliers
Latches as Transition Retaining Barriers in Wallace trees
4.3 Latches as Transition Retaining Barriers in Wallace trees
The analysis by Musoll and Cortadella[45] showed that TRB techniques can be success-
fully used to reduce power dissipation in array multipliers. However, the more interesting
question is whether something similar can be applied to Wallace trees. There are several
issues which warrant such an investigation.
Although Wallace trees are a fairly well-established idea, they have recently becoming
more widely used in industry. The main reason is the much improved delay properties of
this type of partial product reduction stage which make them more attractive for high-speed
design. Just as important has been the realization that although the irregularity of the layout
and wiring requires a much greater design effort than array schemes, the increased availabil-
ity of optimizing CAD tools has facilitated the design process. Finally, the recognition of the
importance of the multiplier (and the consequent willingness to devote much more time to
this block) has led to Wallace tree designs becoming more prevalent.
We decided to determine whether one could apply TRBs to Wallace trees. We targeted
Wallace trees with carry-select final adders, as these were the fastest but highest-power mul-
tiplier designs that we investigated (as shown in Chapter 3).
4.3.1 Placement of LatchesIn the array designs, the placement of latch elements is fairly straightforward, due to the
Analysis and Design of Low Power Multipliers 95
regular structure of the carry-save adders. Recall that an array structure is laid out in levels,
with a very regular, repeated structure. To a first order, signals travel in a wave down the array
(actually, a series of waves, due to unequal logic depths from the inputs,) with adjacent circuit
elements switching at approximately the same point in time. When placing a row of latches in
the array structure, one can trigger all latches in a row at the same time without major effect
on the delay.
The row of latches can be moved up or down in the array, until the optimum point is
reached. Furthermore, several rows of latches can be incorporated into the same array. The
ideal number of rows of latches can be determined experimentally. The optimum occurs when
the overall power is no longer reduced (the overhead of putting more rows of latches starts
(a) (b)
Final adderFinal adder
Figure 29. Width of signal path: (a) In arrays, the width of the signal path is constant at all logic dephs, but in Wallace trees (b), the width of the signal path is greater at shallower logic depths, but width decreases as the logic depth increases.
96 Analysis and Design of Low Power Multipliers
Latches as Transition Retaining Barriers in Wallace trees
a reg-
num-
-
as
the
e tree,
mber
overwhelming the power reduction achieved through glitch reduction)
4.3.2 Wallace Tree Latch PlacementThe placement of latches in a Wallace tree is more complicated than arrays. This is due
to the varying “width” of the signal path at a given logic depth. Recall that arrays have
ular structure consisting of a row of carry-save adders which is repeated n-1 times for an n
bit width multiplier. For the partial product reduction stage, at a given logic depth, the
ber of gates in a row of carry-save adders is always n. This means that we can block all sig
nal paths in the PP reduction stage using (n * # of inputs) latches. We can describe arrays
having a “width” of O(n) in the PP reduction stage (Fig. 29a).
Wallace trees, on the other hand, have a width which is variable by logic depth. At
first level, the number of signal paths is O(n * log3/2 n). The width of the signal path
decreases until we have two bit vectors to add; the final width is approximately 2n - log3/2n
(Fig. 29b).
We can clearly see that there is a bias towards placing a row of latches deeper in th
since each at each level, the number of blocks in a row decreases by log3/2n. (This is in con-
trast to the bias towards having latches at a shallower logic depth, to minimize the nu
of elements in the inverter chain.)
Analysis and Design of Low Power Multipliers 97
4.3.3 Latch Insertion MethodologyOur initial design flow for latch insertion relied upon hand-tuning to optimize results. The
placement of the latches was determined procedurally. Refinement of latch placement was
performed by hand. The connection of signals from the inverter delay chain to generate timing
of the latches was also arranged by hand. The procedure is as follows (Fig. 30):
• Create a chain of inverters .
• Determine placement of latches.
• Set timing of latches.
• Trim inverter chain.
Final adder
clk
1
2
3
4
Figure 30. Procedure for latch insertion: 1) create a chain of inverters 2) determine placement of latches 3) set timing of latches 4) trim inverter chain.
98 Analysis and Design of Low Power Multipliers
Latches as Transition Retaining Barriers in Wallace trees
ch are
d loca-
hese
gest
Wal-
Create a chain of inverters
Initially, we need to create a chain of inverters to provide a series of trigger points
(points in time) for latches. First, we perform a static timing analysis of the multiplier. This
gives us a maximum delay through the multiplier. We then create a primary input which will
provide the initial timing signal for the latches. To this input, we add a delay block (two
back-to-back inverters), and calculate its delay. If this delay is less than the delay of the mul-
tiplier, we add a new delay block to the output of this delay block. We repeat this process
until a chain of inverters has been created whose overall delay is equal to or greater than the
delay of the multiplier—this means we are guaranteed to be able to trigger latches whi
placed anywhere in the multiplier.
Determine placement of latches
Good locations for latch placement are determined, and latches are inserted. Goo
tions are points which yield a maximum of power reduction using the fewest latches. T
points are determined empirically through hand experiments. Two locations initially sug
themselves for latch insertion: at the Wallace tree/final adder boundary, and inside the
lace tree (Fig. 31).
Analysis and Design of Low Power Multipliers 99
The Wallace tree/final adder boundary is a good candidate for two reasons: 1) a row of
TRBs can be placed here using a small number of latches because the "width" of the multi-
plier is smaller and 2) the adder is at a deep logic depth, and is experiencing a great amount of
false switching. A disadvantage is that since this is at a deep logic depth, a long inverter chain
is needed to generate timing for the latches.
The second location, inside the Wallace tree, is potentially rewarding because the effects
of the TRBs can be felt in downstream logic. Therefore, switching activity can be reduced in
the Wallace tree as well as the adder. Since these are at a shallower logic depth, fewer invert-
ers will be needed. However, the "width" of the multiplier at shallower logic depths is much
greater.
These points form the basis of the experiments described in the next section.
Final adder
Figure 31. Potential latch insertion sites: a) at the Wallace tree/final adder boundary and b) in the Wallace tree.
(a)
(b)
Wallace tree
100 Analysis and Design of Low Power Multipliers
Latches as Transition Retaining Barriers in Wallace trees
re
re the
educ-
Set timing of latches
Prior to latch insertion, a static timing analysis has been performed on all inputs to logic
blocks in the multiplier. These times serve as a basis for determining when latches should be
triggered. For example, for inputs A, B, and C, if signals arrive at time (A = 5ns), (B = 7 ns)
(C = 9ns), latches should be placed on A and B, and they should be triggered at time 9ns.
Timing is determined on an incremental basis: latches are ordered by the time they are to
be triggered, earliest to latest. First, the earliest triggered latches are connected to the point
on the delay line (chain of inverters). As mentioned earlier, connecting latch inputs to an
inverter output increases the delay of the inverter. Therefore, after connecting latches to an
inverter, the timing of the entire delay line is recalculated. Then the next set of latches is
connected to the delay line, and the process iterates until all latches have been connected.
Trim inverter chain
Generally, loading the delay line causes the inverter chain delay to increase. The final
inverters’ delay often exceeds the initial multiplier maximum delay, so extra inverters a
trimmed from the end of the chain.
4.3.4 Placement of Latches on the Wallace Tree/Final Adder Boundary Since Wallace tree PP reduction stages have a “width” which is narrowest just befo
final adder, we decided to investigate putting the latches between the partial product r
Analysis and Design of Low Power Multipliers 101
rive all
need
ny
ee
tion stage and the adder. This would minimize the number of latches which we need to insert.
Quite a bit of false switching occurs in the final adder, and the power is worse for adders
which have highly parallel carry-lookahead schemes.
If a row of latches is placed between the PP reduction stage and the adder, the timing of
the latches can be optimized to take into account the carry-in ripple effect of the adder. If the
timing of each latch at a higher bit position is successively delayed, the inputs to each block of
the adder can be made to arrive at the same time (roughly) as the carry-in. (Timing is not
exact, due to input-dependent delays.) Inserting the delay blocks is simply an extension of the
delay blocks necessary to delay the trigger signal. Note that these delay blocks will also con-
sume power. We call this “cascading” the trigger signal.
When driving a final adder using the cascaded delay elements pattern, we cannot d
latches from a single buffer since the timing of each latch is different. However, if all the
latches are driven by different timing signals (points on the inverter chain), we may not
to buffer the timing signals driving the latches, since one timing signal does not see ma
loads (as explained in Fig. 28). The question was to determine whether timing signals s
heavy loads.
102 Analysis and Design of Low Power Multipliers
Latches as Transition Retaining Barriers in Wallace trees
is
east
der
layed
ies are
pple
everal
e inputs
For final adders which are of the “ripple” variety, the minimum delay of each block
determined by the carry. Each full-adder block’s outputs are successively delayed at l
one gate level due to the ripple effect. This means that if the other inputs to the full-ad
block are to be latched, they should be triggered by a signal which is successively de
by one adder block (Fig. 32)
Consider the final adder implemented by a carry-select scheme. The delay propert
shown in Chapter 2, Fig. 9. Recall that this kind of final adder consists of a series of ri
adders operating in parallel. Since the ripple adders operate in parallel, the inputs of s
ripple adders often are added at the same time, and therefore, latches placed on thes
Final Adder
trigger
delay blocklatch
d
d d d
Figure 32. "Cascade" triggering style - in this method, the triggering signal of latches on the inputs of an adder are incrementally delayed, so that the inputs are delayed to compensate for the delay of the carry.
Analysis and Design of Low Power Multipliers 103
should be triggered at the same time. Therefore, the condition exists where the delay signals
drive several latches in parallel and see heavy load.
Faced with this condition, we can either 1) use buffers to drive the latch timing signals,
and maintain (roughly) load-invariant timing characteristics of the delay chain, or 2) allow the
timing of the delay chain to be affected by the loads, and insert/remove some delay elements
to get the proper timing. We believed that 2) was more straightforward to implement and was
also lower in power, so we chose this method.
The potential for power savings comes from the false switching present in the final adder.
There is a great deal of false switching due to the successive delay of the carry signal through
the adder. The other inputs to the adder arrive at a more uniform delay time, so at high bit
positions, the inputs to the adder blocks arrive much earlier than the carry blocks. Further-
more, in very fast adders such as the carry select, there are a large number of blocks in the
adder, and the total amount of capacitance switched by skewed inputs is greater than simple
ripple adders.
Results for this analysis are described in section 4.4.
4.3.5 Placing Latches in Wallace TreesJust like in array schemes, the greatest amount of false switching occurs in the later stages
of the Wallace tree. In [45], the authors found that the power reduction as a function of latch
104 Analysis and Design of Low Power Multipliers
Latches as Transition Retaining Barriers in Wallace trees
placement (in terms of logic depth) was a concave function with a minimum halfway down
the array (See Fig. 26). These results suggest that we should try to move the latches up from
the adder inputs, into the PP reduction stage itself.
Latch placement for transition reduction is based on levels. Elements are ordered by
their minimum and maximum logic depths. One then decides to place latches at "2 elements
down from the inputs" or "3 elements up from the outputs." For example, placing latches "0
elements up from the outputs" means placing latches at the outputs (see Fig. 33).
We developed a simple procedure which calculates the minimum and maximum logic
depths of each component. We empirically tried placement of latches at a logic depth "up
from the bottom" by various levels. We then attempted to generate the timing signals for
these latches.
logic depth
min: 1max: 1
min: 2max: 2
logic depth
min: 1max: 1
min: 2max: 3
min: 1max: 2
Figure 33. Placing latches by logic depth. a), b) calculating logic depth, c) placing latches "one up from the bottom."
(a) (b) (c)
latches
Analysis and Design of Low Power Multipliers 105
es.
e min-
t Wal-
"up"
f
Placement of latches
A difficulty which we encounter when placing latches in the PP reduction stage is that the
notion of “levels of logic” which is present in array schemes is ill-defined for Wallace tre
Arrays have the property that for every component in the array, "minimum logic depth" =
"maximum logic depth", as in Fig. 33a. Wallace trees have cases like Fig. 33b, where th
imum and maximum logic depth can be quite different. For example, a section of a 16 bi
lace tree multiplier is shown in Fig. 34 If we attempt to place latches at one level of logic
from the final adder, latches are places as is shown in Fig. 34b.
We can note two effects from placing the latches at this position. First, the number o
HA
CSA
HACSACSA
CSA CSACSA
PP reduction
adder
HA
CSA
HACSACSA
CSA CSACSA
PP reduction
adder
Figure 34. Placement of latches in Wallace tree. a) Original structure, and b) placement of latches, "one level" up from the PP reduction/adder interface.
(a) (b)
106 Analysis and Design of Low Power Multipliers
Latches as Transition Retaining Barriers in Wallace trees
ls
re
dant
al of
while
st,
le-
ob-
gering
y ele-
ns in
latches increases tremendously, as expected, due to the greater width of the tree at low logic
depth and the lesser width of the tree at higher logic depth, near the final adder.
Most problematic is the lack of clearly defined “levels”. Since we have some signa
which ‘skip’ logic, as described in Chapter 3, Fig. 17, we encounter several paths whe
there is more than a single latch blocking the signal. We must therefore remove redun
latches and generate appropriate timing for the remaining latch elements. This remov
latches is an ad-hoc process, where we attempted to minimize the number of latches
maintaining the transition retaining barrier effect.
Timing issues
Inserting latches in the PP reduction stage creates further difficulties for timing. Fir
note that we have more latches to trigger. This will cause more loading on the delay e
ments which drive the latches, and will create a very complicated timing generation pr
lem.
Secondly, since the latches are present at a shallower logic depth, the times for trig
the various latches are closer together (in time.) Therefore, it is more likely that a dela
ment will have to drive more latches. Again, this causes load-dependent delay variatio
our timing chain.
Analysis and Design of Low Power Multipliers 107
cad-
P
d. It is
ted to
er.
allace
ed in
ng sig-
false
bits.
, but
that
Thirdly, we must now consider timing to reduce glitching in the partial product reduction
stage as well as in the adder. For carry-select blocks, timing requirements of the adder conflict
with the timing requirements of the partial product reduction stage. Adders desire a “cas
ing” timing pattern, so that their inputs are made valid when the carry becomes valid. P
reduction stages want their inputs to be made valid when their neighbors’ inputs are vali
not clear which of these two effects should have precedence over the other. We attemp
generate timing for both these cases, to determine whether one dominates over the oth
4.4 Experiments
In this section, we describe results for several experiments using latch insertion in W
trees. As we mentioned in the introduction, it turns out that the power reductions achiev
the Wallace tree were quite small when compared to the cost of adding latches and timi
nals. We will first describe our experiments and the offers some analysis.
We initially ran latch insertion experiments on 8 bit multipliers, but power dissipation
results were consistently 20% worse than the base multiplier, without any latches. Since
switching tends to increase as multiplier size grows, we ran all other experiments at 16
For comparison, Musoll and Cortadella[45] achieved 5% worse power for TRBs at 8 bits
latches yielded a 7% power improvement over the base multiplier at 16 bits. Note also
108 Analysis and Design of Low Power Multipliers
Experiments
r the
al
ts (60
[45] includes physical design characteristics, while we have not incorporated these effects,
which would tend to make latched multiplier power dissipation worse.
4.4.1 Experiment: Latch Placement on the Wallace Tree/Final Adder BoundarySince we are looking to trigger latches at the boundary of the PP reduction stage and the
final adder, the timing signals were connected so as to generate a “cascade” effect fo
latches on the final adder, as mentioned above. Latches were placed all across the fin
adder, as shown in Fig. 35. In all, 54 latches were placed, along with 30 delay elemen
back-to-back inverters.)
Final adder
multiplicand multiplier
1616PPA generators
256
Wallacetree
Figure 35. Placement of latches on Wallace tree/Final adder boundary
Analysis and Design of Low Power Multipliers 109
n”, to
wer
or a 16
1%
worse
es is
s. We
t deal
e base
to
highly
y
are
tting
We connected the timing signals to the latches by hand and attempted to minimized power
dissipation by varying the timing of the latches to minimize false switching in the adder. In
these experiments, we used latches without the back-to-back inverters for “state retentio
minimize latch power. We were able to reduce power dissipated in the adder, but the po
dissipated in the delay chain and the latches was in all cases greater than the savings. F
bit Wallace tree with carry-select adder, the lowest energy design that we arrived at was
worse than the base case. Furthermore, the delay caused by latch insertion was 5-10%
than the base case.
We investigated several other alternatives. A characteristic of arithmetic CMOS devic
that false switching is worse at higher order bit positions than at lower order bit position
tried removing latches at the lower order bit positions, since they are not removing a grea
of false switching. The result was increased power dissipation (around 5% worse than th
case.) Removing latches at lower bit positions removes capacitance, but one still needs
keep the entire delay chain as the high order bit positions receive timing signals that are
delayed.
In conclusion, it appears that although adder switching reductions can be obtained b
inserting latches at the Wallace tree/final adder boundary, we see that power reductions
minimial when compared to the cost of inserting extra devices. We next investigated pu
110 Analysis and Design of Low Power Multipliers
Experiments
the latches inside the Wallace tree itself.
4.4.2 Experiment: Placing Latches in the Wallace TreeWe attempted to place latches in various configurations, to reduce glitching while mini-
mizing the number of extra latches involved. Moving the latches in various parts of the PP
reduction tree required a complete resynthesis of the timing chain, to reset correct delay
driving.
In the first case, we placed latches at "one level up" from the Wallace tree/final adder
boundary. Since this places redundant latches in the signal path, as described in Fig. 34, we
Figure 36. Placement of latches on within Wallace tree.
Final adder
multiplicand multiplier
1616PPA generators
256
treeWallace
Analysis and Design of Low Power Multipliers 111
ation
en-
lace
t
itions,
of
ndary
e
ed
removed some of these latches by hand. Timing was also performed by hand, and several tim-
ing configurations were tried. Following removal of redundant latches, we ended up with 93
latches and 20 delay elements.
In this experiment, we were never able to achieve power reduction, with respect to the
base Wallace tree multiplier without latches. In the best case, our design’s power dissip
was 10% worse than the base multiplier design, again with approximately a 10% delay p
alty.
Similar to the analysis of [45], we tried placing latches at higher levels up in the Wal
tree. In all cases, power dissipation was over 15% worse than the base case. We did no
observe that reduction of false switching at one "level" causes a great reduction of false
switching at subsequent depths of logic, as was the case in arrays.
As in the previous experiments, we attempted to remove latches at low order bit pos
within the Wallace tree. Again, this did not improve the power dissipation characterisics
this design.
4.4.3 ConclusionsOur best result achieved was to nearly break even when inserting latches at the bou
of the Wallace tree/final adder, for a 16 bit multiplier. That is, the power consumed by th
extra circuitry (latches and delay chain inverters) was nearly offset by the power remov
112 Analysis and Design of Low Power Multipliers
Experiments
ist in
s are
ing of
l delay
through the reduction of false switching in the array. Furthermore, note that this did not
include layout details (e.g., interconnect capacitance), which means that real implementa-
tions of latched multipliers probably have even worse power dissipation characteristics than
those determined by the above experiments.
Latch insertion for Wallace trees does not seem to generate power savings. Although
success has been obtained when using latches to eliminate false switching in arrays, a simi-
lar result was not achieveable in Wallace trees. We believe that this is due to differences in
the structure of the two types of design.
First, the Wallace tree has a very wide signal path. This width, which is constant in array
designs, starts off very wide at low logic depth but decreases logarithmically until the final
adder is reached. Even at the PP reduction/adder interface, this width is greater than the
width of the array. This means that more latches are necessary to create a transition retaining
barrier. If the designer wishes to create a TRB in the PP reduction stage, a very large number
of latches is needed.
Placement of the latches is unclear, due to the lack of clearly defined “levels”, as ex
arrays. Since many paths reconverge in multipliers, and the logic depths of these path
not equal, it is possible to have a path go through several latches—in this case, the tim
the latches is complicated, and since each latch adds delay to the signal path, the tota
Analysis and Design of Low Power Multipliers 113
sign
to
ately,
tch
educ-
r opti-
are
sug-
may become very large.
Timing of the latches using a delay chain is straightforward in array designs, since the
latches are triggered at the same time. Wallace tree designs require latches to be timed at dif-
ferent intervals, and this causes extra load on the delay chain, which impacts the timing itself.
Furthermore, the extra numbers of latches present in these designs means the overall loads can
be quite large. Overcoming the load-induced delay effects may require buffering, which will
adversely impact power.
4.5 Conclusions
In this work, we had originally intended to explore by hand various ‘corners’ in the de
space of latch insertion for power reduction, and then to use the initial empirical designs
decide where to focus a more general automated latch insertion methodology. Unfortun
none of the options we tried—latch insertion at the Wallace tree/final adder boundary, la
insertion at shallower logic depth in the Wallace tree—yielded any encouraging power r
tions. Therefore, we abandoned the idea of developing an automated latch-based powe
mizing system. An basic result of these investigations is that although power reductions
achieved through latch insertion, the cost of adding these extra elements to Wallace trees is
too expensive, when compared to the small amount of switching that is elminated. This
114 Analysis and Design of Low Power Multipliers
Conclusions
e
gests that removal of circuitry may be the best approach to power reduction in Wallace tree;
this is the subject of the following chapter.
It is unclear whether there are any circumstances under which Wallace trees lend them-
selves to latch-based power reduction techniques. For arrays, it can be argued that false
switching is so bad and the amount of power to be reduced is so great that many techniques
stand a good chance achieving gains in these designs. The main reason for array inefficiency
is the great discrepancy in logic depths that signal paths traverse before arriving at the same
logic block. This is not a major concern in Wallace trees.
Another more subtle advantage inherent in Wallace trees comes from their relatively
shallow logic depth. Unequal delays occur in every logic circuit due to reasons other than
logic depth, i.e., b) high delay gates and c) input dependent delay. Generally, for deeper
logic depth, the cumulative effect of these delays is magnified; since Wallace trees are shal-
lower logic depth, these effects will not be as important as they are in arrays.
Based on the above characteristics, we believe that latch insertion is not a viable power
reduction technique in Wallace trees. In the previous chapter, we showed that Wallace trees
are superior in terms of power dissipation, in comparison to arrays. Latches achieve power
reduction by lessening false switching—it is therefore not surprising that they are mor
effective in array designs. Although false switching is present in Wallace trees, it is the
Analysis and Design of Low Power Multipliers 115
activity*Capacitance product which is problematic, as Wallace trees have long interconnect
lines which have high capacitance. Therefore, schemes which reduce capacitance in Wallace
trees should be more effective in reducing power.
116 Analysis and Design of Low Power Multipliers
5 Minimizing Power Via Inverse Polarity Optimization
5.1 Introduction
Latch insertion fails to yield power savings when applied to Wallace trees because
although a great deal of false switching occurs in Wallace trees, this false switching is a
result of different effects (unequal logic depth, input-dependent delays, etc.) There is no one
effect which generates a large amount of false switching (e.g., in arrays, successive logic
depth skew), nor is there one problematic sub-block which makes an easy target for optimi-
zation. In Wallace trees, a more useful approach might be to make a small modification
which is pervasive throughout the tree; although false switching cannot be eliminated, its
effects may be reduced.
In this section, we investigate the possibility of removing some circuitry from the basic
Analysis and Design of Low Power Multipliers 117
rters
arith-
ple.
ck of
er
, the
ple-
Fig.
adder blocks of the multiplier, while maintaining the adder functionality. Ultimately, this is
the goal of logic synthesis: to use the minimum gate count necessary to implement a function.
Such a goal works towards minimizing the area required for the design, which in turn has ben-
eficial effects at the physical level (smaller die area, shorter interconnect, etc.) We consider a
technique which has been developed for delay optimization of adders, and we try to adapt it
for power optimization of multipliers.
5.2 Polarity Inversion
The technique which we term “inverse polarity” attempts to remove unnecessary inve
from the Wallace tree during partial product reduction. It has previously been applied to
metic logic as a delay reduction technique[25]. We illustrate the idea with a simple exam
Inverse polarity is applied to ripple adders in the following manner: two numbers, A = [a0
a1 ... an-1] and B = [b0 b1 ... bn-1] are added to form C = [c0 c1 ... cn-1]. Each carry-save adder
performs two functions: calculating the sum and calculating the carry. The carry sub-blo
the full adder consists of two logic stages: the inverted-carry ( ) stage and the invert
stage (Fig. 37a). For the ripple adder as a whole, in terms of the number of logic stages
delay is 2n. The inverse polarity technique attempts to remove the inverter stage by com
menting the input bits at every other bit position. In this manner, an adder of the form in
carry
118 Analysis and Design of Low Power Multipliers
Polarity Inversion
37b results, and the delay in terms of logic stages is n. Since the inverter has been removed,
there may be a loss in drive strength, when the input loads of the next stage are driven.
Therefore, these gates may need to be upsized. However, since the logic depth has been cut
in half, the overall delay may be significantly reduced.
Given that this removal of inverters does not change the logic function, we may use this
new configuration to apply another optimization. If devices retain the same size, a power
reduction may be obtained simply because the number of transistors implementing the adder
has been decreased. Note that in the example of Fig. 37, the actual number of inverters has
not actually been reduced, due to the need for inverting the inputs values at every other
stage. However, in practical cases, it may not be necessary to invert these values, and the
transistor count may then go down. If the gates are not upsized, it is not clear a priori what
will happen to the delay. Depending on the gate driving strength and associated capacitive
load, it may go up or down.
Analysis and Design of Low Power Multipliers 119
In multipliers, the majority of CMOS gates are found in the partial product (PP) reduction
stage. Closer analysis of the PP reduction network shows that the operation resembles a two-
dimensional ripple addition, with carries propagating to higher order bit positions, and sums
remaining at the same bit order (Fig. 37c) This insight leads to the idea of applying the inverse
polarity technique to the PP reduction stage.
In the following sections, we will show how inverse polarity circuits can be used in Wal-
lace tree multipliers to lower power dissipation. We describe inverse polarity circuitry and
develop a heuristic for implementing a Wallace tree PP reduction circuit using polarity inver-
b0a0
....carry
b1a1
carry
b2a2
carry
b0a0
carry
b1a1
carry
b2a2
....
(a)
(b)
Figure 37. Inverse polarity optimization (a) Conventional ripple adder. (b) Inverted polarity version (c) Multiplier PPA structure (array).
carry
....
....
....
(c)
pp pp
pp pp
....
....
....
120 Analysis and Design of Low Power Multipliers
Design Issues for Polarity Inversion
ple-
ing
ic
e first
ay be
size.
r
ly
place-
g in
nside
of a
sion. We examine the effects of the inverse polarity optimization on delay. Additional
effects, such as area and parasitic capacitance reduction are also investigated. In this discus-
sion, we denote the carry design with output inverter (Fig. 37a) as the ’conventional’ im
mentation.
5.3 Design Issues for Polarity Inversion
Digital CMOS logic design focuses on implementing a logic function while minimiz
transistor count, subject to delay and power constraints. To this end, many CMOS log
gates are designed to drive large interconnect lines by using a two-stage structure: th
stage implements the logic, and the second stage consists of a buffer (i.e., an inverter) to
drive output capacitance. In this manner, transistors implementing the logic function m
small while the buffer transistors can be made large, resulting in lower total transistor
While the logic stage/buffer structure provides strong drive for large loads, multiplie
circuits have the interesting characteristic that large net capacitances are not common
encountered. With a few exceptions, most connections are 2-point nets, and if proper
ment can ensure that connected components are located fairly close together, resultin
short wires, the effects of low drive on delay can be minimized. Therefore, other than i
the partial product generating cells and in some of the final adder cells, the advantage
Analysis and Design of Low Power Multipliers 121
28T”.
n
rep-
buffer structure is minimal. This suggests that removal of output inverters may realize lower
power dissipation with minor effects on delay. If a logically equivalent implementation using
fewer buffers can be assembled, the number of switching transistors will be reduced and there
should be a corresponding decrease in power dissipation. We apply this technique to Wallace
tree multipliers to determine resultant power savings.
5.3.1 Adder Circuit DesignsThe fundamental building block in digital multipliers is the full adder, used in the partial
product reduction phase to perform carry-save addition (therefore sometimes called a carry-
save adder, or CSA.) The CSA takes three inputs and calculates two outputs, sum and carry.
The most commonly used implementation (modified from [1]) is shown again in Fig. 38. As
mentioned in previously, this implementation has 28 transistors, hence the designation “
Note that this circuit incorporates the logic stage/buffer structure which is beneficial whe
driving large output capacitances.
The inverse polarity paradigm identifies bits as one of two polarities—that is, bits will
resent the results of additions, i.e., sum and carry (positive polarity—POS), or their comple-
ments sum and carry (negative polarity—NEG). Inverted polarity circuits require that an
122 Analysis and Design of Low Power Multipliers
Design Issues for Polarity Inversion
he
a
ant:
adder with POS inputs provide NEG outputs and vice-versa. Therefore:
CSAIP(a,b,c) = (sum,carry)
CSAIP(a,b,c) = (sum, carry).
Here, the ’IP’ subscript denotes inverse polarity cells. The 28T implementation of t
CSA can be transformed into a CSAIP by simple removal of the inverters—this is because
complement of the circuit inputs to a CSA yields sum and carry. The half adder (HA) on the
other hand cannot be so easily constructed (see Fig. 39). For half adders, again, we w
HAIP(a,b) = (sum, carry)
HAIP(a,b) = (sum, carry).
b
carry
a
c a
b
ac
a ba c
ba c
a
b
c
a
b
c
sum
Figure 38. The 28T full adder (CSA) implementation.
Analysis and Design of Low Power Multipliers 123
h
le-
If the inputs to an HA are complemented, the ‘carry’ signal is not the dual of the carry sig-
nal (see Fig. 39) Therefore, two versions of the HAIP are required—one for POS inputs whic
yields sum and carry and another for NEG inputs that gives sum and carry. These are imp
mented using the circuits shown in Fig. 40)
bca 00 01 11 10
0
1
bca 00 01 11 10
0
1
sum(a,b,c)
carry(a,b,c)
1
0
0
0
0
0
1 1
1
1
1 1
1
0 0 0
bca 11 10 00 01
0
1
bca 11 10 00 01
1
0
1
0
0
0
0
0
1 1
1
1
1 1
1
0 0 0
Figure 39. Conventional implementation vs. inverse polarity equivalence (a) For full adders CSA(a,b,c) = CSA(a,b,c) (b) But for half adders, HA(a,b) != HA(a,b).
ba 0 1
0
1
ba 0 1
0
1
sum(a,b)
carry(a,b)
0
0
0
1
1
1
0 0
ba 1 0
0
1
ba 1 0
1
0
0
1
0
1
1
1
0 1
sum(a,b)
carry(a,b)
HA
CSAsum(a,b,c)
carry(a,b,c)
124 Analysis and Design of Low Power Multipliers
Design Issues for Polarity Inversion
Therefore, although the CSA can be easily transformed into an inverse polarity version
by simply removing the output inverters, the HA cells need to be redesigned for when
implementing some inverse polarity versions.
5.3.2 Partial Product Reduction A greedy heuristic was proposed by Fadavi-Ardekani in [33] to construct a Wallace style
partial product reduction tree while minimizing logic depth. In this method, a priority queue
stores the bits for each column of the trapezoidal PP array, ordered by the largest static delay
time of the bit.
The algorithm proceeds on a column-by-column basis, starting at the lowest bit position,
where the earliest arrival bits are added using a CSA or HA; the resulting sum bit goes into
b
carry’
a
a
a
ab
b
b
sum’
b’
carry’
a’
a’
a’
a’
b’
b’
b’sum’
Figure 40. HAIP designs (a) POS inputs, NEG outputs, (b) NEG inputs, POS outputs
Analysis and Design of Low Power Multipliers 125
polar-
to
the priority queue of the current column, and the carry bit is placed in the queue of the next
column. (See Figure 41.) We can modify this basic scheme to construct inverse polarity based
multipliers. To use inverted polarity elements, we first require that all the inputs to a gate be of
the same polarity, either POS or NEG. In array multipliers, this can be achieved fairly easily,
since each logic level of adders can be of opposite polarities. In Wallace trees however, some
adders’ inputs come from signals of different logic levels (see Fig. 42). To create equal-
ity inputs, we must also track bit polarity and in some cases, inverters must be inserted
complement input bits.
Figure 41. Fadavi-Ardekani algorithm (a) Trapezoidal PP array — all bits in column n to be added, plus carry-in bits from column n-1. (b) Bits are put in a priority queue (one queue for each column, FA’s are applied to earliest arriving bits, yielding (c) two bit vectors to be added by the final adder.
(a) FAsc
(b)
(c)
carry-in’sfrom prev. col.
126 Analysis and Design of Low Power Multipliers
Design Issues for Polarity Inversion
5.3.3 Inverse Polarity Wallace Tree AlgorithmOur goal is to minimize the number of inverters which need to be added to equalize
input bit polarity. This can be achieved by noting that in the case where there are a large
number of bits, we will always be able to find a usable number bits of the same polarity. In
particular, we wish to use 3-input CSA blocks to perform PP reduction. (We can also use 2-
input HA blocks, but using such blocks when CSAs could be used is wasteful.) If we are
using 3-input carry-save adders, if there are more than 4 bits present whose polarity is either
POS or NEG, we are guaranteed to be able to find 3 bits of the same polarity. Therefore,
inverters do not need to be used until we have reduced a column to 4 bits or less. At this
point, inverters may be inserted, if needed, to equalize bit polarities prior to addition.
(a) (b)
Figure 42. Inverse polarity - (a) arrays have regular alternating structure. (b) Wallacetrees have connections which “skip” a level, causing polarity conflicts at subsequentadder inputs.
? ?level skipped
level skipped
Analysis and Design of Low Power Multipliers 127
e
he PP
er-
the
save
ity
r
e low-
than 5.
are
e a
rity. In
tes an
mn.
ur-
The inverse polarity assembly starts in the partial product generation circuit—here th
AND gate can be replaced by NAND gates. The only effect this has is to change all of t
bits to a NEG polarity. We are doing two’s complement multiplication using the ’sign gen
ate’ method to precompute the ’1’ bits used for sign extension[26].
The detailed algorithm appears in Fig. 43. Given the inputs to the Wallace tree from
partial product generation stage, we wish to ’tile’ or ’cover’ the bits to be added by carry-
adders (i.e., CSAs and HAs), as described by Dadda[24]. To assemble the inverse polar
multiplier, (the TILE procedure) we provide two priority queues for each column, one fo
POS and one for NEG bits. Bits of the same polarity are selected from the queue with th
est delay bit. This procedure iterates until the total number of bits in both queues is less
In the STOP procedure, we attempt to reduce the number of bits down to 2. If there
originally 4 bits remaining, we use a CSA, otherwise (there are 3 bits remaining), we us
HA. When there are 3 or 4 bits, we cannot guarantee that the inputs are of the same pola
this case, we use inverters to normalize all bits to the same polarity.
Since this greedy assembly algorithm uses the lowest delay bits each time it instantia
adder, the procedure heuristically minimizes the growth of the maximum delay per colu
The stopping condition is the only point where extra inverters are inserted, for the sole p
128 Analysis and Design of Low Power Multipliers
Design Issues for Polarity Inversion
Procedure TILE(column i) {/* put bits into queues POSi or NEGi; */
while( (#bits{POSi} + #bits{NEGi}) > 4 ) {
/* Pick queue to work on */if (earliest{POSi} < earliest {NEGi})
choose_Qi = POSi;
else choose_Qi = NEGi;
Instantiate IP adder {choose_Qi}
add earliest bits of choose_Qi;
put output bits on choose_Qi and choose_Qi+1;
} STOP((#bits{POSi} + #bits{NEGi}));
}
Procedure STOP(total){switch (total):case 3:
if (#bits{POS} >= 2) -> use HAIP on POS bits;
if (#bits{NEG} >= 2) -> use HAIP on NEG bits;
case 4:if (#bits{POS} >= 3 ) ->use CSAIP on POS bits;
if (#bits{NEG} >= 3 ) ->use CSAIP on NEG bits;
if (#bits{POS} == 2 and #bits{NEG} == 2) -> find latest bit (ltbit = POS/NEG)
-> use CSAIP on bits of type ltbit(use inverters to equalize inputs);
}
Figure 43. Basic inverted polarity CSA tiling algorithm.
Analysis and Design of Low Power Multipliers 129
less
are
to
delay
rity
racter-
tput
be
verse
have
pose of setting equal the input polarities of an adder. Once the partial product array has been
completely reduced, two bits are present at each bit position. Note that the polarities of these
final bits may well be mixed, i.e., at any given column, the final two bits may be both POS,
both NEG, or one POS and one NEG. At this point, a final adder is created to generate the final
result. This adder may also need to insert an inverter at some of its inputs, to equalize the
polarity of these inputs.
Note: even though extra inverters are ’inserted’, the total inverter count will always be
than the conventional multiplier. In inverse polarity construction algorithms, all inverters
initially removed from the design, then a few are selectively ’put back’. The net effect is
reduce transistor count.
5.3.4 Physical Design IssuesIn Wallace trees, interconnect capacitance is a non-negotiable factor in determining
and power characteristics of a multiplier. In comparing conventional versus inverse pola
designs, the output loading of each adder cell is also important in determining these cha
istics. In particular, delay values for inverse polarity designs are highly dependent on ou
load. For example, if the output load is negligible, inverse polarity implementations may
faster, due to reduced logic depth. On the other hand, if the output load is significant, in
polarity circuits will have higher delay, because with their output inverters omitted, they
130 Analysis and Design of Low Power Multipliers
Design Issues for Polarity Inversion
relatively weak drive strength.
Our placement methodology initially followed the one developed for for our array ver-
sus Wallace tree experiments of Chapter 3, where we used a simple, grid-based simulated
annealing placement tool to create layouts. (Recall that we used a general placement tool to
determine the layout because Wallace trees do not easily lend themselves to a regular layout
structure.) Results for placing inverse polarity structures were disappointing in that they had
very high delay. We eventually discovered that the majority of the delay was located in the
final adder.
In theory, the final adder blocks should be placed fairly close together, resulting in very
short interconnect and therefore not be highly influenced by the low drive of the inverse
polarity circuits. In practice, since our wirelength minimizing simulated annealing algorithm
does not distinguish among functional blocks in determining placement, the goal of close
placement for components of the final adder was lost.
To improve placement characteristics, we developed a simple two phase layout method-
ology: 1) final adder blocks are arranged using a procedural placement technique, 2) simu-
lated annealing is used to group elements in the Wallace tree which have high connectivity.
The procedural placement technique attempts to arrange the final adder blocks next to the
Analysis and Design of Low Power Multipliers 131
en,
rows
al
ssary
or
outputs. It first calculates the total area necessary for the final adder’s components.. Th
based on the available width of the placement area, computes how many standard cell
are necessary for the block. Finally, the adder blocks are ’trowelled’ in until the entire fin
adder has been placed.
Estimates of the footprint required for each basic logic block were calculated for the
MOSIS HP 0.5 µm CMOS technology. The estimates were based on the layout area nece
for an inverter, and the area for general logic blocks was calculated as 1/2 their transist
(a)
(b)
inputs
outputs
width
0 1 2 3 4 5 6 7
0
1 2
3 4
5 6
7
(c)
Figure 44. Procedural assembly of final adder (a) Calculate width of final adder (b) based on width of placement area, (c) determine number of rows needed, and trowein the final adder.
Wallacetree
adder
31
31
30
132 Analysis and Design of Low Power Multipliers
Experimental Results using Layout Estimation
and
of
ation,
count times this base area.
Interconnect net length was calculated using a Steiner tree construction for multipoint
nets with Manhattan distances for each segment (0.165fF/µm cap. to ground.) Delays are
calculated using static timing analysis based on an HSPICE[51] characterization of cells.
Power is calculated using Star-sim[52] from Avant! Corp., which allows fast power compu-
tation with high accuracy. Similar to the results in Chapter 3, we repeatedly simulated sets of
20 vectors, so as to arrive at a confidence interval of 95% with a 5% error[10]. The power
numbers which we cite later on in the chapter are the average values of all simulations for a
given multiplier. Some sample runs we performed to test the resolution of Star-sim. We
found Star-sim’s accuracy to be within +/-2% of HSPICE runs.
5.4 Experimental Results using Layout Estimation
We designed a number of Wallace tree multipliers, one set with conventional logic
the other with inverse polarity circuits. The count of inverters present in two such sets
multipliers is shown in Table 6. These include the inverters in the partial product gener
Wallace tree, and final adder.
Analysis and Design of Low Power Multipliers 133
Table 7 and Table 8 compare energy per operation between conventional and inverse
polarity multipliers. In all cases, minimum size devices were used to minimize power con-
sumption. Results clearly indicate a power advantage for inverse polarity multipliers. A more
pronounced advantage is seen in larger multipliers with carry-select adders; these have the
greatest number of adder circuits, so reduced transistor count is most beneficial in these
cases..
Table 6: Count of inverters in multipliers
Multiplier Size Conventional Inverse Polarity
8 216 47
16 824 141
Table 7: Energy / operation for 8 bit multipliers
Final Adder Style
Conventional Inverse Polarity Power Reduction
Ripple 3.78e-11 J. 3.32e-11 J. 12.2%
Carry Select 4.93e-11 J. 4.63e-11 J. 6.1%
Table 8: Energy / operation for 16 bit multipliers
Final Adder Style
Conventional Inverse Polarity Power Reduction
Ripple 4.27e-9 J. 3.49e-9 J. 18.3%
Carry Select 5.45e-9 J. 4.13e-9 J. 24.2%
134 Analysis and Design of Low Power Multipliers
Experimental Results using Layout Estimation
A potential source of parasitic power dissipation in inverse polarity circuits is increased
short circuit (totem pole) current due to more slowly falling/rising inputs, which result from
the inverse polarity optimization. Detailed simulations found this effect to be insignificant.
Short circuit current involves a conducting path between the source and ground rails.
The magnitude of this current is proportional to the conductance of this path, which is a
function of the transistor W/L ratios in each stack. Given that all transistors are the same
size, the conductance of the inverter should be higher, since there are fewer transistors in the
stack (1 transistor) compared to the full adder block (from 2-3 transistors per stack, not
counting output inverters.) Due to this, the short circuit power of inverters should be higher.
We simulated the switching behavior of an inverter and a full adder. The average current
per nanosecond during discharge is as shown in the graphs below:
3.29E-05
1.46E-05
3.33E-05
2.23E-05
Figure 45. Average short circuit currents (a) full adders have lower short circuit currents than (b) inverters.
Short circuit currentCapacitance charging current
Analysis and Design of Low Power Multipliers 135
This shows that an inverter has slightly higher short circuit current than the full adders. We
expect that if the inverters were upsized (to drive higher loads), their short circuit current
would be even greater.
Delay, area and interconnect characteristics are shown in Table 9. These results show two
different architectures, conventional logic and inverse polarity logic, as well as two typs of
final adder, ripple and carry select. Inverse polarity circuits are slightly slower than conven-
tional circuits by 3% - 6%. Delay penalties can be eliminated by judicious use of carry-save
adders: conventional adders for the critical path and inverse polarity adders for non-critical
paths. The delay penalty is entirely due to interconnect capacitance of the Wallace tree; when
we simulated the design without interconnect, we found that the inverse polarity versions
were slightly faster. It is our belief that the Wallace tree layouts can be improved (through the
use of better placement tools.) Further refinement of the Wallace tree layout should thus
improve the performance of inverse polarity designs. Table 9 also shows area and interconnect
Table 9: Delay, area, wire cap. for 16 bit multipliers
WT-Ripple IP WT -Ripple WT -Carry Sel. IP WT-Carry Sel.
Delay 38.9 ns 40.0 ns 35.0 ns 36.9 ns
Area 8586 µm2 7368 µm2 9804 µm2 8576 µm2
Wire Cap. 11,872 fF 11,234 fF 13,290 fF 12,059 fF
136 Analysis and Design of Low Power Multipliers
Experiments with Detailed Layout
ze,
mag-
capacitance numbers for 16-bit multipliers. As expected, area-of-implementation is smaller.
Furthermore, total interconnect capacitance was less for inverse polarity designs, which is
due to smaller area of implementation; closer components require less wiring.
5.5 Experiments with Detailed Layout
The previous analysis was performed using physical design characteristics which were
estimated based on data taken from the SPICE models. We estimated the area needed for the
layout of logic elements by looking at existing standard cell libraries and extrapolating sizes
for the Hewlett Packard 0.5µm process. Wiring capacitance was determined by calculating
the capacitance of a Metal1 line to ground, which was multiplied by the length of the wire.
Based on these values, we created a layout with estimated interconnect wires, which gave a
measure of capacitive load.
5.5.1 Enhanced MethodologyTo conclude this analysis, we would like to analyze our multipliers using realistic lay-
outs. There are several important effects which might be under-represented by our earlier
estimation methodology. First, the area of each cell was computed as a function of the num-
ber of transistors—although there is a correlation between transistor count and cell si
there will be a certain amount of error involved. Furthermore, we are not aware of the
Analysis and Design of Low Power Multipliers 137
ini-
,
es of
n-
ere
nitude of cell parasitic capacitances. Secondly, the wire capacitance was estimated by global
wiring, which was calculated as simply the Manhattan distance between components. This
will be off because detailed wiring may be longer than the Manhattan distance, if the intercon-
nect is required to take a "snake"-like path, for example, due to heavy wiring congestion
between two connected components. Finally, if delay is significantly altered by interconnect,
this will impact power dissipation, as explained in earlier chapters: greater skew between sig-
nals at the input of components will create more false switching.
To estimate the above effects, we developed a standard cell library in the modern STMi-
croelectronics .25µm CMOS process, using the Cadence tool flow for detailed geometry gen-
eration. The flow, shown in Fig. 46, is as follows:
• Cell library development: we developed a standard cell library, using designs from the
existing STMicroelectronics HCMOS7 standard cell library, where applicable. The m
mum dimensions of transistors in the ST HCMOS7 library seem to be W = .75µm, L =
.25µm, so we took these dimensions to be "minimum size". There were 7 basic cells
which we used to form 22 standard cells—for example, the CSA is composed of 3 typ
basic cells: a carry stage, a sum stage and two inverters. Metal1 is used for local interco
nection, although for high congestion cells, polysilicon or Metal2 can be used for
• additional routing. Metal1 is not restricted to follow a grid, but it does follow a grid wh
138 Analysis and Design of Low Power Multipliers
Experiments with Detailed Layout
lated
e
tal cell
possible (so that later routing tools can use metal1, where it is left over.) The layers used
for routing are Metal2 (vertical), Metal3 (horizontal) & Metal4 (vertical).
• Create placement: as described previously, our placement is based on a simple simu
annealing algorithm, which attempts to minimize total wirelength while optimizing
(eliminating) overlap of logic blocks. In some cases, the final placements had som
overlap remaining, which was fixed by hand. The placement area was based on to
create design
generate DEF
Silicon Ensemble (routing)
Import to Cadence
DIVA (extraction)
Netlisting of Capacitances
Compare Capacitances Rerun static timing
Figure 46. Physical design verification methodology.
create standard cell library
generate LEF
Analysis and Design of Low Power Multipliers 139
gs
inute
en
ST
trac-
bit
. As
ation
xper-
ou-
paral-
size, plus a margin of about 10%. Different aspect ratios were examined to see what effect
these had on the wiring length. Our resulting placement was encoded into the Cadence
DEF format, which was passed to Cadence Silicon Ensemble for routing, along with a
LEF description of the cell library.
• Wire with Silicon Ensemble: the routing stage was uncomplicated—using default settin
of Silicon Ensemble, we performed global and detailed routing. The routing took 1 m
for a 4 bit multiplier and 5 minutes for an 8 bit multiplier. The resulting routing was th
exported to the Cadence Design Framework.
• Extract, Back Annotate, Simulate: the layout was then verified and extracted using the
.25µm rules with the DIVA extraction program. We encountered problems with the ex
tion stage, as DIVA has the tendency to rename nets in the extraction process. For 4
multipliers, about 5 nets were renamed; for 8 bit multipliers, 30+ nets were renamed
we were unable to determine the reason for net renaming, this was the primary limit
on the size of multipliers which can be investigated using this methodology; all our e
iments were performed on 8 bit multipliers.
Note that the versions of the rules for DIVA that are available to us, adjacent wiring c
pling is not extracted. Therefore, the capacitance numbers are the capacitance of a
lel-plate-capacitor Metal1 line to ground, plus the fringe capacitance.
140 Analysis and Design of Low Power Multipliers
Experiments with Detailed Layout
5.5.2 Placement DetailsOur placement stage consists in defining a placement area, along with I/O pin locations.
The size of the area is determined by the total area of the components to be placed, plus
some additional area, typically 5-10% extra. We experimented with different aspect ratios,
to determine the effect of shape on overall capacitance. Two sample multipliers are shown in
Fig. 47, the layout on the top is a 4-bit conventional multiplier, and the layout on the bottom
an 8-bit inverse polarity multiplier. Note that the 4-bit multiplier has an aspect ratio of
1.375, while the 8-bit multiplier has an aspect ratio of 1.19.
Analysis and Design of Low Power Multipliers 141
(a)
(b)
Figure 47. Placement for (a) 4-bit conventional multiplier and (b) 8-bit inverse polarity multiplier. The final adder is shown in dotted lines.
142 Analysis and Design of Low Power Multipliers
Experiments with Detailed Layout
In these layouts, the inputs are on the bottom. The lowest order output bit is on the top
left of the placement, and the highest order output bit is on the top right. The final adder
stage comprises the top two rows of the placement area. However, in some cases, the adder
does not completely fill up the top rows, and components from the Wallace tree are placed in
the remaining empty slots (e.g., in the 4-bit adder of Fig. 47a.) Time for our placement algo-
rithms took about 10 minutes for 4-bit adders and to up to 15 hours for 16-bit multipliers.
We were nearly always able to place multipliers without overlap, although in some cases we
had to resolve the overlap by hand. Clearly, a less restrictive placement area and more
sophisticated placement algorithm would result in a better layout with no overlap.
For nearly all types of multipliers, the conventional multiplier versions were harder to
place than inverse polarity multipliers; this can be attributed to more total area of compo-
nents to be placed, resulting in greater possibility of overlap.
5.5.3 Interconnect DetailsOur extraction experiments were run on conventional and inverse polarity 4-bit and 8-bit
multipliers, using various placement areas. We initially investigated large aspect ratios (lay-
out is wider than it is tall), for the reasons given below. The wiring for 8-bit multipliers can
be seen in Fig. 48.
Analysis and Design of Low Power Multipliers 143
Figure 48. Routing of 8-bit multipliers - (top) Conventional 8-bit multiplier and (bottom) inverse polarity 8-bit multiplier.
144 Analysis and Design of Low Power Multipliers
Experiments with Detailed Layout
In these layouts, interconnect was routed on metals 3 and 4. Metal 1 was used for intra-
cell routing, and metal 2 was used for power and ground distribution. (It may be possible to
reduce congestion, by allowing metal2 to be used for routing. Currently, metal 2 is partially
used in the routing, but it is under-utilized as the power lines are routed in the "wrong way"
direction.) Routing was performed using Silicon Ensemble from Cadence, initially perform-
ing a global routing phase, and then performing local routing. All routing was completed
without overflows, and took about 5 minutes (both global and local routing). Extraction was
performed using the Cadence DIVA tool, within the methodology of the ST .25µm design
flow.
The resulting interconnect capacitances are shown in Table 10. We can see that inverse
polarity circuits have less overall interconnect, which is a direct consequence of smaller cir-
cuit count. These capacitances were also back-annotated into our timing tool to determine
Table 10: Simulated and extracted data
4-bitconventional
4-bit inverse polarity
8-bit conventional
8-bit inverse polarity
Capacitance estimate 229.185 fF 222.75 fF 2072.9 fF 1637.79 fF
Static Delay 3317 ps 3603 ps 7231 ps 7704 ps
Extracted Capacitance 228.152 fF 201.398 fF 2248 fF 1702 fF
Delay (ext. cap. based). 3336 ps 3498 ps 7373 ps 7693 ps
Analysis and Design of Low Power Multipliers 145
the longest path through static timing analysis. This data confirms that our capacitance esti-
mates are fairly accurate.
A historgram of nets for the 8-bit inverse polarity multiplier is shown in Fig. 49, to com-
pare the distribution of estimated net capacitance with extracted net capacitance. We see that
the calculated values are very similar, with a slight tendency for the estimation to under-count
the capacitance.
There were several factors which caused the extracted real capacitance to be at variance
with our values.
Count of Net Capacitances
0
10
20
30
40
50
0 5 10 15 20 25 30 35
Capacitance (fF)
Net
Co
un
t Estimated Capacitance
Extracted Capacitance
Figure 49. Estimated versus extracted net capacitances for an 8-bit inverse polarity multiplier
146 Analysis and Design of Low Power Multipliers
Experiments with Detailed Layout
ur
can be
wires
. Ini-
sured
was
ere-
to the
large
cent
nifi-
ur
ns,
ce =
ks that
• Underestimation: during our routing estimation stage, we determined the length of o
global routes to be the manhattan distance between components. This assumption
wrong simply because this does not take into account congestion, which can cause
to take a path which is longer that the basic manhattan distance.
• Overestimation: our capacitance calculation step can also overestimate capacitance
tially, when assuming capacitance was proportional to the length of a wire, we mea
the capacitance of a metal1 line to ground. In our methodology, much of the wiring
performed on metals 3 and 4, which have a much lower capacitance to ground. Th
fore, we tend to overcalculate the capacitance of a given line.
For these experiments, we found that our capacitance estimates were fairly close
actual extracted values of routed examples. In practice, we have seen that although a
amount of wiring is present on the multiplier, we do not see very long lines running adja
to each other.
Of concern is the added capacitance of adjacent wires, which starts to become sig
cant in deep submicron processes. We believe that although this effect is present in o
designs, its impact will be small, for several reasons. In our designs, with few execptio
wiring capacitance is a small amount of the overall node capacitance (node capacitan
interconnect capacitance + gate capacitance). This is because the basic building bloc
Analysis and Design of Low Power Multipliers 147
nts
onges-
ct
partial
we are using, CSAs, have inputs which drive 6-7 transistors each. Therefore, even for mini-
mum sized gates, the capacitance of the inputs are each on the order of 20fF (.25µm process.)
Long wires have capacitance on the order of 17-20 fF. Note furthermore that very few long
wires are encountered, the majority being small to medium sized wires. Therefore, wiring
effects do not dominate the count of total switched capacitance.
5.5.4 Aspect Ratio DetailsWe experimented with different aspect ratios, due to the possibility of wiring congestion in
certain areas of the physical design, particularly after the partial product generation stage. Ide-
ally, the layout of a multiplier might look like Fig. 50. This layout has been designed to opti-
mize several characteristics. First, note that all signal flow is "top to bottom"—compone
which are at deeper logic depths are placed closer to the output pins. This can reduce c
tion by ensuring that no signals flow "bottom to top". Next, note that all the partial produ
generators are clustered at the top of the layout. Since the inputs must fan out to many
product generators, clustering these at the top helps reduce interconnect length.
148 Analysis and Design of Low Power Multipliers
Experiments with Detailed Layout
is
ing the
n the
A problem occurs if we consider the number of connections which must cross the
boundary between PPA generators and the Wallace tree. In general, this number will be
O(n2) for an n-bit multiplier—for large multipliers, we can expect high congestion at th
interface1.
There are several ways to alleviate such congestion. First, we can avoid segregat
PPA generators and the Wallace tree adders, and instead put some PPA generators i
1. At this point, we see one of the advantages of Booth recoding, which can cut the number of PPA bits in half. For this given layout scheme, the number of wires crossing this interface is also cut in half.
Final Adder
Wallace tree
PPA Generators
etc.
Inputs
Wallacetree
Final adder
PPA generators
(a) (b)
Figure 50. Physical design of Wallace tree multiplier (a) logical structure, (b) idealized physical layout.
inputs
outputs
Analysis and Design of Low Power Multipliers 149
case,
o
out,
ity
were
on be
inly
rtant
rse
r
h the
suc-
Wallace tree. This approach eliminates the "top-to-bottom" signal flow, and may lead to
excessive wire length. Second, with multiple-metal-layer technologies, we are not restricted to
one plane pair—these signals can be distributed across multiple levels. In the extreme
we can go away from the HVHV routing paradigm, using some of the horizontal layers t
route signals in the vertical direction. Finally, we can increase the aspect ratio of the lay
using footprints which are wider than they are tall. For these layouts, more wiring capac
will be available in the vertical direction.
Our initial designs had aspect ratios slightly greater than ’1’, resulting in designs that
somewhat wider than tall. However, in practice we do not expect to see wiring congesti
that great of a problem in DSP sized multipliers (that is, small 8-16 bit multipliers), certa
not on the scale present in microprocessor floating point multipliers. Therefore, it is impo
to see if such aspect ratios, although desirable for congestion reasons, may have adve
effects on wiring capacitance. We then attempted to determine the impact of using wide
aspect ratios.
We laid out two versions of our 8-bit multiplier using narrower and taller layouts, and
determined wiring capacitance. The wiring of these designs is shown in Fig. 51. Althoug
aspect ratio is close to ’1’, there were no routing problems, and all nets were completed
cessfully.
150 Analysis and Design of Low Power Multipliers
Experiments with Detailed Layout
Figure 51. Routing of 8-bit multipliers, aspect ratio ~1 - (a) Conventional 8-bit
Analysis and Design of Low Power Multipliers 151multiplier and (b) inverse polarity 8-bit multiplier.
on-
pect
ed at
n-
ts cer-
is an
l char-
ical
ign
The impact of aspect ratio on wiring capacitance is illustrated in Table 11. For these
designs, the height was fixed and the width was allowed to vary. The height measurement
consists of 9µm per standard cell row, plus 10 µm total offset of the pins at the top an bottom
of the layout. We aimed for an aspect ratio of ’1’ for the inverse polarity multiplier—the c
ventional version was slightly wider. As can be seen from the results, a more square as
ratio results in shorter wires and less capacitance. This conclusion is similar to that arriv
when placing general logic blocks. Use of inverse polarity circuits provides certain adva
tages in multiplier construction, but they must be applied judiciously, as their use impac
tain characteristics, most notably delay. We will discuss some of the reasons why delay
issue in inverse polarity circuits, then talk briefly about noise optimization issues.
5.6 Additional Design Considerations
In creating and simulating inverse polarity designs, we were able to discover severa
acteristics about the inverse polarity technique, the way it was assembled, and its phys
design characteristics. In addition to providing power reduction, the inverse polarity des
Table 11: Aspect ratio and capacitance
8-bit conv. version 1
8-bit inv. pol.version 1
8-bit conv.version 2
8-bit inv. pol.version 2
Layout area 180 x 100 µm 160 x 100 µm 140 x 118 µm 120 x 118 µm
Aspect Ratio 1.80 1.60 1.19 1.02
Capacitance estimate 2072.9 fF 1702.8 fF 1754.28 fF 1521.14 fF
152 Analysis and Design of Low Power Multipliers
Additional Design Considerations
rse
at
cuits
epth.
technique is a low-logic depth, low-drive technology. There are interesting implications for
circuit delay, as well as noise properties (signals coupling onto wires from adjacent nets).
5.6.1 Wiring Delay The delay of a circuit is based on drive logic depth, drive strength, and load capacitance.
In using inverse polarity circuits, logic depth is decreased, drive strength is decreased, and
load capacitance (seen at the output of adders) remains fairly constant, although there is a
slight tendency for it to decrease, as smaller die area leads to shorter interconnect, which
generally leads to lower interconnect capacitance. Although the reduction in capacitance is
generally true, capacitance is also a function of the presence of adjacent wires, especially in
deep submicron designs. Therefore, it is uncertain whether the reduction in die area will lead
to significant capacitance reduction.
In attempting to optimize delay, we used the same approach for both conventional and
inverse polarity designs, so as to not favor one over the other. Although the above men-
tioned results show a slight delay penalty in using inverse polarity circuits, again, we believe
that further refinement of the design flow will yield reductions in delay—since the inve
polarity circuits have a greater portion of their delay due to interconnect, we believe th
interconnect reduction techniques are more likely to yield results in inverse polarity cir
than in conventional circuits.
Another interesting characteristic of inverse polarity designs is their smaller logic d
Analysis and Design of Low Power Multipliers 153
If gate upsizing is applied to logic circuits, delay is reduced up to a certain point, beyond
which gate upsizing fails to reduce the delay. At this point, the delay of a logic function is
closely related to its logic depth; therefore, under maximum gate upsizing, inverse polarity
designs should be faster. It remains unclear whether gate upsizing ever reaches this regime or
whether this is simply an ’extreme-case’ effect.
5.6.2 ’Logic-based’ DelayOur PP reduction tree assembly algorithm was based on the work of Ardekani[33], who
presented a simple, efficient heuristic for assembling a Wallace tree, while minimizing delay.
The algorithm proceeds on a column-by-column basis, from bit position 0, to the highest bit
position. In each column, CSA blocks are used to reduce the all bits down to two, which will
be the inputs to the final adder. (The operation was described in Fig. 41.)
Our algorithm is very similar, with the exception that we use two priority queues to store
the bit arrival times, so as to keep track of POS and NEG input bits separately. Ardekani also
does not take into account the input-dependent delay information (different inputs to the same
function have different delays.) There is a problem in using this kind of assembly algorithm
for inverse polarity. Consider the case of Fig. 41b. Here we see that a CSA has been applied to
reduce three bits in a column, and the sum bit is placed back in the priority queue for this col-
umn while the carry it is placed in a priority queue for the next column.
154 Analysis and Design of Low Power Multipliers
Additional Design Considerations
e—the
ract-
When the algorithm is applied to the next column, this carry bit will be connected to the
input of a CSA. This input will add extra load to the output of the carry logic, which will
affect the delay of the sum bit in the previous column (Fig. 52c). In the conventional case, an
inverter is present on the output of the carry, and this prevents loading of the previous col-
umn (Fig. 52b).
This type of loading can have a significant effect on the overall delay of the multiplier. If
several of these types of loading are present in a critical path, the effect can be sever
effect is less important if these types of loading are distributed among several non-inte
FAs
c’
s’
cc’
s’
Figure 52. Inverse polarity loading effects (a) Use of full adder in PP reduction. (b) Circuit implementation of "(a)" using conventional full adder. (c) Implementation using inverse polarity full adder—note that load on carry output affects delay of sum output.
(a) (b) (c)
Analysis and Design of Low Power Multipliers 155
of
nents
ompo-
do
ts
pling
ing paths. This is especially problematic if the carry output of an inverse polarity circuit is
connected to a long interconnect stage. The effect of such loading causes delays in the previ-
ous column to grow. This means that the optimal mapping for minimum delay, which was the
goal of our algorithm, will be thrown off and delays will result that were greater than
expected.
To counter this effect, a more complex PP reduction assembly scheme may be desirable.
For example, the assembly algorithm could make several passes, re-mapping certain connec-
tions which have excessive load. In some cases, delay can be improved at the expense of
power—an inverter can be placed on the output of the carry outputs which are particularly
problematic. Finally, in the physical design stage, one should try to minimize the loading
such carry outputs. For example, in the placement stage, one could cluster some compo
together to reduce interconnect, and then place these clusters instead of placing each c
nent separately.
5.6.3 Noise IssuesAn interesting and potentially useful characteristic of inverted polarity circuits has to
with longer rise- and falltimes (switching times). Since buffer removal implies that outpu
will have shallower slopes (for same-size transistors), the outputs will be weaker at cou
noise onto adjacent lines. A well-known maxim for high speed design [3] is: given two logic
156 Analysis and Design of Low Power Multipliers
Additional Design Considerations
families with identical maximum propagation delay statistics (50% input-50% output
switching points), the family with the slowest output switching time will be cheaper and eas-
ier to use. This is because signals with slow rise and falltimes contain fewer high frequency
components, and therefore couple less noise onto other lines The inverted polarity structure
removes buffers and thereby lowers the drive strengths of CSAs and half adders. Therefore,
the reduction of high drive signals may be a secondary advantage of this technique.
However, one should be aware that nodes which are driven with less strength are them-
selves more susceptible to noise injection. This is because the interconnect line which is
being driven is connected to one of the rails (either Vdd or Vgnd) by a greater impedance.
Hence this node is more isolated from the power supplies and is more easily pulled to differ-
ent voltages by adjacently coupled nodes (i.e., bootstrapped.)
Given these two effects, it is unclear whether there is a net advantage or disadvantage as
far as noise immunity is concerned. Ultimately, this may depend on the circuitry surround-
ing a multiplier. If the multiplier has been a source of noise, the inverse polarity technique
can help reduce this. However, if the multiplier has been susceptible to noise from neigh-
bors, an inverse polarity multiplier will be more so.
5.6.4 Further OptimizationsThe inverse polarity optimization has been shown to effectively reduce transistor count,
Analysis and Design of Low Power Multipliers 157
as well as area-of-implementation, resulting in lower capacitance and therefore less power
consumed. The reduction in drive strength, although partly compensated by a lowering of the
logic depth, results in an increase in delay. All these results hold for minimum sized transis-
tors.
There are several issues which should be addressed when implementing the inverse polar-
ity technique. First, a reduction in the delay of multipliers can be obtained by upsizing
devices. It remains unclear as to whether inverse polarity circuits will be beneficial in this
case. On the one hand, an inverse polarity stage has low drive, so delay would benefit
immensely from device upsizing. However, since a large number of transistors need to be
upsized (as opposed to the conventional buffered circuit, where only the buffer has to be
upsized.) The power-efficiency of inverse polarity circuits may be quickly lost once upsizing
takes place. Nevertheless, one can argue that in the extreme case, the inverse polarity multi-
plier with its smaller logic depth can achieve asymptotically lower delay than the conven-
tional version.
An alternative approach would be to mix inverse polarity adders with conventional adders
in multipliers. The logic circuit of multipliers is like standard combinational logic in that it has
a critical delay path, as well a number of less-than-critical paths. If conventional adders can
achieve higher delay, they can be used on the critical path, and inverse polarity adders can be
158 Analysis and Design of Low Power Multipliers
Summary
used solely for off-critical path elements. Note that this would increase the delay of signal
paths that were previously non-critical, so care must be taken to avoid previously non-criti-
cal paths from becoming critical. In these cases, the power savings depend on the number of
circuit elements which can be replaced by inverse-polarity circuits.
Finally, our analysis of multiplier layout used simulated annealing based designs which
employed procedural placement techniques for the final adder. Although annealing has the
advantageous property that it can be applied to general designs without requiring knowledge
of underlying structure, the loss of structural information may lead to sub-optimal results. In
our layouts, we found that on some occasions, components whose interconnection lay on the
critical path were placed far apart, whereas other components which were less critical were
placed close together. It seems that if more structural information could be provided to the
annealing algorithm, better placements could be achieved.
5.7 Summary
We presented a method of reducing transistor count in Wallace tree multipliers through
reducing the number of inverters needed to complete the operation. Results indicate that up
to 25% power reduction can be achieved for minimum-sized multipliers. There is a small
delay penalty associated with this technique, due to reduced drive strengths. Further gains
Analysis and Design of Low Power Multipliers 159
were achieved in overall interconnect capacitance and die area.
In contrast with latch insertion, we achieved significant power reductions. At a basic level,
this is understandable, since we removed circuitry, while latch insertion adds circuitry. In
order to amortize the additional power dissipation which results from adding circuitry, the
amount of false switching that must be eliminated is quite large. In Wallace trees, we have
seen that the switching activity is much less than in arrays. Therefore, it is not surprising that
the inverse polarity optimization achieved power reduction while latch insertion experiments
failed to produce similar power savings.
160 Analysis and Design of Low Power Multipliers
6 Conclusions
Power dissipation in multiplier designs has been much-researched in recent years, due to
the importance of the multiplier circuit in a wide variety of microelectronic systems. The
focus of multiplier design has traditionally been delay optimization, although this design
goal has recently been supplemented by power consumption considerations. Our goal has
been first to understand how power is dissipated in multipliers, and secondly to devise ways
to reduce this power consumption.
In this thesis, we described previous work which has been done in the area of multiplier
delay and power optimization. We identified methods by which multiplier delay has been
reduced, and we concentrated on understanding how these various speedup techniques
impact the power dissipation of the multiplier as a whole.
Analysis and Design of Low Power Multipliers 161
In Chapter 3, we investigated the application of arrays and Wallace trees to partial product
reduction. Conflicting views on the power dissipation characteristics of these two techniques
led us to more closely analyze switching behavior and interconnect loading characteristics.
We devised a simulation environment and a layout model which allowed efficient investiga-
tion of various multiplier configurations, each having distinct delay and power consumption
characteristics. We concluded that while Wallace trees offer a decided delay advantage over
array schemes, they also offer a power advantage over arrays, as the false switching in array
designs overcomes the power consumed by long interconnect in Wallace trees.
We decided to focus on Wallace trees for power reduction, since these have become the
architecture of choice in recent chip designs, due to their better delay properties. False switch-
ing in Wallace trees can be reduced through the use of latches, as described in Chapter 4.
Unfortunately, the power cost of the latches and the circuitry needed to generate the timing
signal tends to overwhelm the power savings generated by reduced switching in the Wallace
tree. Although applications of latches have proven effective in array schemes, we concluded
that Wallace trees do not benefit from latch insertion.
A more useful approach to reducing power in Wallace trees is reducing circuit count
through removal of redundant inverters. This can be achieved using the inverse polarity opti-
mization, which is described in Chapter 5. By keeping track of bit polarities as the Wallace
162 Analysis and Design of Low Power Multipliers
tree is constructed, we arrive at designs which have reduced circuit count. Power savings of
up to 25% were achieved, along with reductions in die area and interconnect. Delay penal-
ties are present, but may be alleviated through a hybird inverse polarity/conventional con-
struction.
In conclusion, we have presented an investigation of multiplier power dissipation, along
with some techniques which allow reductions in power consumption for this circuit. Given
the importance of multipliers, it is likely that further research efforts will be directed at opti-
mizing this block for delay and power efficiency.
Analysis and Design of Low Power Multipliers 163
164 Analysis and Design of Low Power Multipliers
7 Bibliography
izing
[1] N. Weste and K.Eschraghian, Principles of CMOS VLSI Design, Addison-Wesley,1988, p. 312.
[2] C. Mead and L. Conway, Introduction to VLSI Systems, Addison-Wesley, 1980, p. 166.
[3] H.W. Johnson, and M. Graham, High Speed Digital Design, Prentice-Hall Inc., NewJersey, 1993, p. 60.
[4] T. Cormen, C. Leiserson, R. Rivest, Introduction to Algorithms, McGraw-Hill Publish-ers, Massachusetts, 1990.
[5] R. Hitchcock, Sr., "Timing Verification and the Timing Analysis Program," IEEEDesign Automation Conference, 1982, pp. 594-604.
[6] A.P. Chandrakasan, S. Sheng, and R.W. Brodersen, “Low Power Digital Design,” IEEEJournal of Solid State Circuits, April 1992, vol. 27, pp. 473-484.
[7] A.P. Chandrakasan, M.Potkonjak, R. Mehra, J. Rabaey, R.W. Brodersen, “OptimPower Using Transformations,” IEEE Transactions on CAD, Jan. 1995, Vol. 14, No.1pp.12-31.
Analysis and Design of Low Power Multipliers 165
to
,
ng,
ir-
lt-
ing
od,"
[8] A. Ghosh, S. Devada, K. Keutzer, J White, "Estimation of Average Switching Activity inCombinational and Sequential Circuits," IEEE Design Automation Conference, 1992, pp.253-259.
[9] F. Najm, "Transition Density: A New Measure of Activity in Digital Circuits," IEEETransactions on CAD, 1993, Vol. 12, No. 2, pp. 310-323.
[10] R. Burch, F. Najm, P. Yang, and T. Trick, “McPOWER: A Monte Carlo ApproachPower Estimation,” IEEE/ACM International Conference on Computer Aided Design,1992, pp.90-97.
[11] F. Najm, "A Survey of Power Estimation Techniques in VLSI Circuits," IEEE Transac-tions on VLSI, 1994, Vol.2 No.4, pp.446-454.
[12] J.M. Rabaey and M. Pedram, Low Power Design Methodologies, Kluwer Academic Pub-lishers, Norwell, Mass., 1996.
[13]G.K. Yeap, Practical Low Power Digital VLSI Design, Kluwer Academic PublishersNorwell, Mass., 1998.
[14]T. Sakurai, H. Kawaguchi, and T. Kuroda, " Low-Power CMOS Design through Vth Con-trol and Low-Swing Circuits," Proceedings of the 1997 ISLPED, pp.1-6.
[15]C.H.Tan and J.Allen, "Minimization of Power in VLSI Circuits Using Transistor SiziInput Ordering, and Statistical Power Estimation," International Workshop on Low PowerDesign 1994, pp.75-80
[16]M. Borah, R.M Owens, and M.J. Irwin, "Transistor Sizing for Low Power CMOS Ccuits," IEEE Transactions on CAD, June 1996, vol. 15, no. 6, pp. 665-671.
[17]K. Usami et. al. , “Automated Low-power Technique Exploiting Multiple Supply Voages Applied to a Media Processor,” IEEE Custom Integrated Circuits Conference, 1977,pp. 579-586.
[18]R.K. Krishnamurthy, and L.R. Carley, "Exploring the Design Space of Mixed SwQuadrail for Low-Power Digital Circuits," IEEE Transactions on VLSI, Dec. 1997, Vol.5,No. 4, pp. 388-400.
[19]P. Landman and J. Rabaey, "Architectural Power Analysis: The Dual Bity Type MethIEEE Transactaions on VLSI Systems, June 1995, pp. 173-187.
166 Analysis and Design of Low Power Multipliers
tic,"
gie
ck-
000
Pub-
FIR
[20] L.S. Nielsen, J. Sparso, "Designing asynchronous circuits for low power: an IFIR filterbank for a digital hearing aid," Proceedings of the IEEE, vol.87, no.2, p. 268-81
[21] S. Devadas, S. Malik, "A survey of optimization techniques targeting low powerVLSI," 32nd Design Automation Conference. Proceedings 1995, p. 242-7.
[22] T.L. Martin, D.P. Siewiorek,"A Power Metric for Mobile Systems," International Sym-posium on Low Power Electronics and Design, 1996, pp. 37-42.
[23] C.S. Wallace, “Suggestions for a Fast Multiplier,” IEE Trans. Electron. Computers,1964, EC-13, pp. 14-17.
[24]L. Dadda, “Some Schemes for Parallel Multipliers,” Alta Freq., 1965, 34:349-356.
[25]S.D. Pezaris, "A 40-ns 17-bit by 17-bit Array Multiplier," IEEE Transactions on Com-puters, vol.c-20, no.4, p. 442-7
[26]W.J. Cody, (chairman), "A Proposed Standard for Binary Floating-Point ArithmeCOMPUTER Magazine, 1981, special reprint.
[27]M. Annaratone and W.Z.Shen, “The Design of an LSI Booth Multiplier,” CarneMellon University, Thesis Report, no. CMU-CS-84-150.
[28]M. Santoro and M. Horowitz, “SPIM: A pipelined 64x64-bit Iterative Multiplier,” IEEEJournal of Solid-State Circuits, April 1989, Vol. 24, No. 2, pp.487-493.
[29]M. Borah, R. Owens, M.J Irwin, “High-throughput and Low-power DSP Using CloCMOS Circuitry,” International Symposium on Low Power Design, 1995, pp. 139-144.
[30]R.K. Montoye E. Hokenek, S.L. Runyon, “Design of the IBM RISC System/6Floating-Point Execution Unit,” IBM Journal of Research and Development, January1990, Vol. 34, No. 1, pp. 59-77.
[31]D. Goldberg, “Computer Arithmetic,” in Computer Architecture A QuantitativeApproach, J.L. Hennessy and D.A. Patterson, San Mateo, CA: Morgan Kaufmannlishers, Inc., 1990, pp. A1-A66.
[32]L.E. Thon, P. Sutardja, F. Lai and G. Coleman, “A 240MHz 8-Tap ProgrammableFilter for Disk-Drive Read Channels,” IEEE International Solid State Circuits Confer-ence, 1995, pp. 82-83.
Analysis and Design of Low Power Multipliers 167
al-.
eger
ub-
duc-
ing
out
[33] J. Fadavi-Ardekani, “M x N Booth Encoded Multiplier Generator Using Optimized Wlace Trees,” IEEE Transactions on VLSI Systems, June 1993, Vol. 1, No. 2, pp. 120-125
[34]D. Carlson, et. al., "A 677MHz RISC Microprocessor Containing a 6.0ns 64b IntMultiplier," IEEE International Solid State Circuits Conference, 1998, pp. 294-295.
[35]B. Ackland, et. al., “A Single-Chip 1.6 Billion 16-bit MAC/s Multiprocessor DSP,” IEEECustom Integrated Circuits Conference, 1999, pp. 537-540.
[36]Y. Hagihara, et. al.,"A 2.7ns 0.25mm CMOS 54x54b Multiplier," IEEE InternationalSolid State Circuits Conference, 1998, pp. 296-297.
[37]T.K. Callaway, and E.E. Swartzlander Jr., “Low Power Arithmetic Components.” in LowPower Design Methodologies, Rabey, J. and Pedram, M., eds., Kluwer Academic Plishers, Norwell, Mass, 1996, pp. 161-198.
[38]A. Bellaouar, and M.I. Elmasry, Low-Power Digital VLSI Design, Circuits and Systems,pp. 442-450, Norwell, Mass: Kluwer Academic Publishers, 1995.
[39]B. Ackland, C.J. Nicol, “High Performance DSPs - What’s Hot and What’s Not?” Inter-national Symposium on Low Power Electronics and Design, 1998, pp. 1-6.
[40]P. Larsson, C.J. Nicol, “Transition Reduction in Carry-Save Adder Trees, ” InternationalSymposium on Low Power Electronics and Design, 1996, pp. 76-79.
[41]C.J. Nicol, P. Larsson, “Low Power Multiplication for FIR Filtering, ” International Sym-posium on Low Power Electronics and Design, 1997, pp. 76-79.
[42]R. Fried, “Minimizing Energy Dissipation in High-Speed Multipliers,” InternationalSymposium on Low Power Electronics and Design, 1997, pp. 214-219.
[43]C. Lemmonds and S. Shetti, “A Low Power 16 by 16 Multiplier using Transition Retion Circuitry,” International Symposium on Low Power Design, 1994, pp. 139-142.
[44]E.d.Angel, "Low Power Digital Multiplication," in Application Specific Processors, E.E.Swartzlander, ed., Kluwer Academic Publishers, Norwell, Mass, 1997.
[45]E. Musoll and J. Cortadella, “Low-Power Array Multipliers with Transition-RetainBarriers,” Fifth International Workshop on Power and Timing Modeling, October 1995.
[46]P. Meier, R. Rutenbar, and L.R. Carley, “Exploring Multiplier Architecture and Layfor Low Power,” IEEE Custom Integrated Circuits Conference 1996, pp. 513-516.
168 Analysis and Design of Low Power Multipliers
tions
CA,
[47] R. Rutenbar, “Simulated Annealing Algorithms: An Overview,” IEEE Circuits andDevices Magazine, Jan. 1989, pp. 19-26.
[48]M. Sivaraman and A.J. Strojwas, “Towards Incorporating Device Parameter Variain Timing Analysis”, Proceedings of the European Design Conference, 1994, pp. 338-342.
[49] J. Fishburn, "Switch Level Tools" in Algorithms and Techniques for VLSI Layout Syn-thesis by D. Hill, et. al., Kluwer Academic Publishers, 1989, pp. 153-179.
[50] J. Fishburn, “A Depth-Decreasing Heuristic for Combinational Logic,” 27th ACM/IEEE Design Automation Conference, 1990, pp. 361-364.
[51]HSPICE User’s Guide, Meta-Software Inc. (now Avant! Corporation, Fremont, 1992)
[52]Star-Sim User’s Guide, Avant! Corporation, Fremont, CA, June 1997.
Analysis and Design of Low Power Multipliers 169
170 Analysis and Design of Low Power Multipliers