analysis and design of low power digital multipliers · figure 1. digital multiplication flow. 21...

CARNEGIE MELLON UNIVERSITY

Analysis and Design of Low Power Digital Multipliers

A DISSERTATIONSUBMITTED TO THE GRADUATE SCHOOL

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

for the degree of

DOCTOR OF PHILOSOPHYin

ELECTRICAL AND COMPUTER ENGINEERING

byPascal Constantin Hans Meier

Pittsburgh, PennsylvaniaAugust, 1999

iii

Acknowledgements

This work represents the culmination of several years of hard work exploring and developing ideas in the field of analysis and design of low power digital multipliers. Although this thesis prominently bears my name, it really is the result of the collaborative efforts of a number of peo-ple who have discussed, encouraged, cajoled, and otherwise supported me, through successes and failures, to its ultimate publication. It is to these people that this work is dedicated, and I would like to acknowledge their contributions.

My advisors at Carnegie Mellon University, Professors Rob Rutenbar and L. Richard Carley, were my primary guides in this journey. Professor Rutenbar helped me develop a clear under-standing of CAD theory and practice through his concise expositions in courses and one-on-one discussion. Professor Carley’s expertise in circuit design greatly broadened my design experience, particularly in the ‘art’ of analog circuit design, a knowledge of which becomes most useful to the digital designer who tries to ‘push the envelope’. The third member of my thesis committee, Pro-fessor Larry Pileggi was especially helpful in questioning my ideas, in particular with respect to technology migration issues.

I was fortunate to have two excellent advisors from industry, Dr. Chris Nicol and Dr. John Fish-burn, both from Lucent Technologies. Dr. Nicol’s experience in multiplier power reduction greatly aided in determining promising areas of investigation, and his knowledge of prior work helped me plan the course of this research. Dr. Fishburn’s extensive background in system-level CAD tools provided for a number of interesting and lively discussions on power and delay ten-dencies of various multiplier designs. For their help I am most grateful.

My fellow graduate students were a constant source of support, and I was greatly enriched by their presence in technical discussions, social events and daily activities: Mehmet Aktuna, Bulent Basaran, Sari Coumeri, Prakash Gopalakrishnan, David Guillou, Rony Kay, Michael Krasnicki, Ram Krishnamurthy, Mark Mescher, Joshua Park, Michele Quarantelli, Nick Zayed. I write this list, knowing that I have forgotten someone—you know who you are. The staff of the Electrical and Computer Engineering Department were equally supportive, especially when the challenges came. I owe a debt of gratitude to my surrogate ‘mom’, Lynn Philbin and to my good friend, Lyz Knight. We made it after all.

The struggle of graduate school and numerous other undertakings would not have reach fruition without the support of many friends in Pittsburgh. In particular, the Catholic community in the Oakland and surrounding areas was a source of constant support, and I am particularly indebted to the Oratorian Fathers of the Ryan Catholic Newman Center/Pittsburgh Oratory for guidance dur-ing this time.

Finally I wish to specially thank my parents for all their encouragement, and Peggy Chan, who patiently supported me through this work.

This work was funded by DARPA under contract ADAAL 01-95-K-3527 and the NSF under contracts 942037 and9408457.

Table of Contents

1 Introduction 11.1 Power Reduction in Integrated Circuits 31.2 Integrated Circuit Multiplication 61.3 Delay and Power in Multipliers 71.4 Power vs. Energy 111.5 Research Approach 121.6 Thesis Outline 15

2 Background: Multipliers and Power Dissipation 192.1 Multipliers 19

2.1.1 Multiplier Structure 202.1.2 Partial Product Generation 212.1.3 Partial Product Reduction 232.1.4 Array Style Reduction 262.1.5 Wallace Tree Partial Product Reduction 282.1.6 Partial Product Reduction/Generation Using Booth Recoding 302.1.7 Final Adder 332.1.8 Summary 38

2.2 CMOS Power 382.2.1 Power Optimization Fundamentals 42

2.3 Multiplier Power Reduction 492.3.1 Logic Level Multiplier Optimization 512.3.2 Architectural Power Optimization 542.3.3 Low Power Multiplication Research 55

2.4 Summary 57

3 Power Trade-offs in Array and Wallace Tree Multipliers 593.1 Introduction 59

3.1.1 Partial Product Reduction Schemes 603.1.2 Analysis of Switching Behavior 613.1.3 Initial Investigations 65

3.2 An Improved Analysis methodology 663.2.1 Generation of Component Library 673.2.2 Circuitry 68

Analysis and Design of Low Power Multipliers iv

Table of Contents

3.2.3 Cell Characterization 693.2.4 Partial Product Reduction Generators 713.2.5 Adder Generators 723.2.6 Layout Model and Wallace Tree Placement 723.2.7 Power and Delay Estimation 74

3.3 Experimental Results 753.3.1 Layout Characteristics 763.3.2 Energy Per Operation 763.3.3 Delay 773.3.4 Further Modelling Refinements 80

3.4 Summary 81

4 Minimizing Switching Activity By Latch Insertion 834.1 Introduction 834.2 False Switching in Multipliers 84

4.2.1 Input Latching 854.2.2 Latching the Signal Path 874.2.3 Previous Work 894.2.4 General Principles of TRB Insertion 91

4.3 Latches as Transition Retaining Barriers in Wallace trees 954.3.1 Placement of Latches 954.3.2 Wallace Tree Latch Placement 974.3.3 Latch Insertion Methodology 984.3.4 Placement of Latches on the Wallace Tree/Final Adder Boundary 1014.3.5 Placing Latches in Wallace Trees 104

4.4 Experiments 1084.4.1 Experiment: Latch Placement on the Wallace Tree/Final Adder Boundary 1094.4.2 Experiment: Placing Latches in the Wallace Tree 1114.4.3 Conclusions 112

4.5 Conclusions 114

5 Minimizing Power Via Inverse Polarity Optimization 1175.1 Introduction 1175.2 Polarity Inversion 1185.3 Design Issues for Polarity Inversion 121

5.3.1 Adder Circuit Designs 1225.3.2 Partial Product Reduction 1255.3.3 Inverse Polarity Wallace Tree Algorithm 127

v Analysis and Design of Low Power Multipliers

5.3.4 Physical Design Issues 1305.4 Experimental Results using Layout Estimation 1335.5 Experiments with Detailed Layout 137

5.5.1 Enhanced Methodology 1375.5.2 Placement Details 1415.5.3 Interconnect Details 1435.5.4 Aspect Ratio Details 148

5.6 Additional Design Considerations 1525.6.1 Wiring Delay 1535.6.2 ’Logic-based’ Delay 1545.6.3 Noise Issues1565.6.4 Further Optimizations157

5.7 Summary 159

6 Conclusions 161

7 Bibliography 165

Analysis and Design of Low Power Multipliers vi

Table of Contents

vii Analysis and Design of Low Power Multipliers

List of Figures

Figure 1. Digital multiplication flow. 21

Figure 2. Partial product generation for 6-bit by 6-bit multiplication. 22

Figure 3. Partial product addition. (a) Full adder cell. (b) Basic ripple adder. (c) We can use ripple addition to add all the shifted copies of the multiplicand. (d) Since there are n-1 ripple adders, each of width n, this basic method takes O(n2). 25

Figure 4. Array partial product reduction- (a) the initial partial product (b) using a row of carry-save full adders to reduce 3 bit vectors down to two (c) resulting PPA (d) full array structure. 27

Figure 5. Wallace tree partial product reduction. (a) partial product array (b) parallel carry-save addition (c) resulting PPA (d) complete Wallace tree structure. 29

Figure 6. Booth recoding (radix 4). 32

Figure 7. Ripple adder 34

Figure 8. Carry Skip Adders: (a) One carry-skip block (b) 12-bit adder. 35

Figure 9. Carry Select Adders: (a) Basic structure of adder, (b) delay chart, and (c) construction for minimum delay. 36

Figure 10. Equation for CMOS power consumption 40

Figure 11. Short circuit current occurs when CMOS devices switch. If the input of the gate is in the region where PMOS and NMOS devices are both on, current will flow directly between the rails. Short circuit current can occur if the gate is not held close to the Vdd or Vgnd rails. 41

Figure 12. Biasing for Vt modification via backgate effect 43

Figure 13. Pulldown stack of 3-input NAND 46

Figure 14. Switching behavior for different input arrival times 47

Figure 15. Output glitching. 48

Figure 16. Sequential vs. parallel layouts for logic blocks. 62

Figure 17. Signal flow in arrays vs. Wallace trees. (a) In arrays, inputs are present at every level of logic depth, so digital circuits at deeper logic levels experience more switching. However, the carry-save adders are arranged in rows, so signals tend to flow in "waves" down the logic. (b) Wallace trees have inputs at one logic level, so input data arrives in parallel and flows downward. However,

Analysis and Design of Low Power Multipliers viii

List of Figures

pth.

e oo that

(d)

rate

s the ses

l of that

some connections "skip" a logic level and so input arrival times tends to be skewed at deeper logic levels. 64

Figure 18. Transition count comparison for multipliers. 65

Figure 19. Full adder, also called carry-save adder, implemented using the ’28T’ construction. 68

Figure 20. Multiplier layout model. 73

Figure 21. Estimated average energy per multiply op. 77

Figure 22. Estimated worst-case multiplier delay (ns). 79

Figure 23. Forms of glitch-inducing delay 85

Figure 24. Using latches to re-time the generation of partial products (Booth). 86

Figure 25. Using latches to equalize signal arrival times in the signal path (Transition Retaining Barriers.) 88

Figure 26. The false switching in an array is linear in terms of the logic dea) For a 16 bit multiplier, the false switching at logic depth 16 is approximately 8 toggles/operation. b) Inserting a TRB saves somof the switching (gray box.) c) The TRB should not be inserted tearly or d) too late, as this lessens the amount of false switching is elminated. 90

Figure 27. Incorporating latching behavior into an inverter. (a) Inverter (b) C2MOS version of inverter. (c) Incorporating state preservation. Using a pass gate. 92

Figure 28. Triggering of latches. A chain of inverters may be used to genethe delay signal. If all latches are driven in parallel a) the final signals should be buffered. Otherwise, b) the delay chain can beused unbuffered, assuming the load is not so great. 94

Figure 29. Width of signal path: (a) In arrays, the width of the signal path iconstant at all logic dephs, but in Wallace trees (b), the width of signal path is greater at shallower logic depths, but width decreaas the logic depth increases. 96

Figure 30. Procedure for latch insertion: 1) create a chain of inverters 2) determine placement of latches 3) set timing of latches 4) trim inverter chain. 98

Figure 31. Potential latch insertion sites: a) at the Wallace tree/final adderboundary and b) in the Wallace tree. 100

Figure 32. "Cascade" triggering style - in this method, the triggering signalatches on the inputs of an adder are incrementally delayed, so

ix Analysis and Design of Low Power Multipliers

in

A’s to

. (b)

l

t

r

the inputs are delayed to compensate for the delay of the carry. 103

Figure 33. Placing latches by logic depth. a), b) calculating logic depth, c) placing latches "one up from the bottom." 105

Figure 34. Placement of latches in Wallace tree. a) Original structure, and b) placement of latches, "one level" up from the PP reduction/adder interface. 106

Figure 35. Placement of latches on Wallace tree/Final adder boundary 109

Figure 36. Placement of latches on within Wallace tree. 111

Figure 37. Inverse polarity optimization (a) Conventional ripple adder. (b) Inverted polarity version (c) Multiplier PPA structure (array). 120

Figure 38. The 28T full adder (CSA) implementation. 123

Figure 39. Conventional implementation vs. inverse polarity equivalence (a) For full adders CSA(a,b,c) = CSA(a,b,c) (b) But for half adders, HA(a,b) != HA(a,b). 124

Figure 40. HAIP designs (a) POS inputs, NEG outputs, (b) NEG inputs, POS outputs 125

Figure 41. Fadavi-Ardekani algorithm (a) Trapezoidal PP array — all bits column n to be added, plus carry-in bits from column n-1. (b) Bits are put in a priority queue (one queue for each column, Fare applied to earliest arriving bits, yielding (c) two bit vectors be added by the final adder. 126

Figure 42. Inverse polarity - (a) arrays have regular alternating structureWallace trees have connections which “skip” a level, causing polarity conflicts at subsequent adder inputs. 127

Figure 43. Basic inverted polarity CSA tiling algorithm. 129

Figure 44. Procedural assembly of final adder (a) Calculate width of finaadder (b) based on width of placement area, (c) determine number of rows needed, and trowel in the final adder. 132

Figure 45. Average short circuit currents (a) full adders have lower shorcircuit currents than (b) inverters. 135

Figure 46. Physical design verification methodology. 139

Figure 47. Placement for (a) 4-bit conventional multiplier and (b) 8-bit inverse polarity multiplier. The final adder is shown in dotted lines. 142

Figure 48. Routing of 8-bit multipliers - (top) Conventional 8-bit multiplieand (bottom) inverse polarity 8-bit multiplier. 144

Analysis and Design of Low Power Multipliers x

List of Figures

Figure 49. Estimated versus extracted net capacitances for an 8-bit inverse polarity multiplier 146

Figure 50. Physical design of Wallace tree multiplier (a) logical structure, (b) idealized physical layout. 149

Figure 51. Routing of 8-bit multipliers, aspect ratio ~1 - (a) Conventional 8-bit multiplier and (b) inverse polarity 8-bit multiplier. 151

Figure 52. Inverse polarity loading effects (a) Use of full adder in PP reduction. (b) Circuit implementation of "(a)" using conventional full adder. (c) Implementation using inverse polarity full adder—note that load on carry output affects delay of sum output. 155

xi Analysis and Design of Low Power Multipliers

List of Tables

Table 1: Booth encoding 31

Table 2: Asymptotic time and space characteristics 38

Table 3. Array multipliers - estimated area (mm2) 76

Table 4. Wallace tree multipliers - estimated area (mm2) 76

Table 5. 8 bit multiplier - estimated delay (ns) 78

Table 6: Count of inverters in multipliers 134

Table 7: Energy / operation for 8 bit multipliers 134

Table 8: Energy / operation for 16 bit multipliers 134

Table 9: Delay, area, wire cap. for 16 bit multipliers 136

Table 10: Simulated and extracted data 145

Table 11: Aspect ratio and capacitance 152

Analysis and Design of Low Power Multipliers xii

List of Tables

xiii Analysis and Design of Low Power Multipliers

1 Introduction

With the increasing level of device integration and the growth in complexity of micro-

electronic circuits, reduction of power dissipation has come to the fore as a primary design

goal. While power efficiency has always been desirable in electronic circuits, only recently

has it become a limiting factor for a broad range of applications, requiring consideration

early on in the design process.

Power dissipation limitations come in two flavors. The first is related to cooling consid-

erations when implementing high performance systems. High speed circuits dissipate large

amounts of energy in a short amount of time, generating a great deal of heat as a by-product.

This heat needs to be removed by the package on which integrated circuits are mounted.

Heat removal may become a limiting factor if the package (PC board, system enclosure,

Analysis and Design of Low Power Multipliers 1

heat sink) cannot adequately dissipate this heat, or if the required thermal components are too

expensive for the application.

The second failure mode of high-power circuits relates to the increasing popularity of por-

table electronic devices. Laptop computers, pagers, portable video players and cellular phones

all use batteries as a power source, which by their nature provide a limited time of operation

before they require recharging. To extend battery life, low power operation is desirable in inte-

grated circuits. Furthermore, successive generations of applications often require more com-

puting power, placing greater demands on energy storage elements within the system.

Technology improvements in the last few decades have succeeded in reducing power con-

sumption. Trends such as using CMOS instead of bipolar devices, and reduction in feature

size of lithographic processes have served to reduce power dissipation, although other objec-

tives, namely high integration and speed, were the primary goals of such improvements. It is

only in recent times that power optimization has been viewed as a design goal in its own right.

This work will look at power optimization applied to a circuit element which is pervasive

in modern microelectronic circuit design, the digital multiplier. In this chapter, we will

describe power optimization in general terms, especially how it fits in to current design flows.

We will briefly describe the multiplier and how it is used for different applications, along with

some design goals for building suitable multipliers. Finally, we give an overview of the issues

2 Analysis and Design of Low Power Multipliers

Power Reduction in Integrated Circuits

investigated in this work, and describe some of the results.

1.1 Power Reduction in Integrated Circuits

Although the above discussion has motivated the ascendancy of power to the attention

of the designer, it is important to understand the place that power has relative to other objec-

tives. After guaranteeing correct digital functionality, the primary consideration for system

designers has always been, and continues to be speed. A circuit is specified to operate at a

particular delay, otherwise the entire system does not work; further reduction in the delay is

beneficial but not strictly necessary. The power dissipation characteristics of a system, on

the other hand, are often a consequence of the delay specification. Once the delay of the sys-

tem is achieved, package cooling and/or battery resources will be allocated appropriately

(within reason). Other factors may have equal or greater importance than power dissipation:

area of implementation and yield/reliability issues are subjects which the designer must take

into account. Nevertheless, the increase in power dissipation in successive generations of

integrated circuit technologies has progressed at such a pace that it is now one of the pri-

mary considerations for designers.

A major complication in microelectronic circuits is the fact that many design decisions

involve a power-delay trade-off; one cannot be lowered without raising the other. It is


itical

sipa-

which

not

ost

manu-

uc-

l of the

s

le, low-

nstant

y volt-

oltage

n-

ils for

important to note, however, that power reduction techniques are not necessarily negatively

correlated to delay reduction. For example, one method to reduce delay in a circuit’s cr

path is to upsize the driving strength of gates, which also results in increased power dis

tion. However, another way to lower delay might be to reduce interconnect capacitance,

reduces both power and delay. Generally, great power savings can be achieved if delay is

an issue, but optimizing power without considering delay is trivial.

Power reduction techniques are applied at all levels of the design hierarchy. At the m

basic level, the parameters of the lithographic process in which the integrated circuit is

factured may be modified. Doping concentrations, minimum geometrical spacings of str

tures, etc., all affect power dissipation. Many of these parameters are beyond the contro

circuit designer. Some of these ‘global’ constraints, however, may be modified at variou

stages of the design methodology, although they are by and large left alone. For examp

ering of Vdd is a well-accepted method of reducing chip power, and there has been a co

trend towards running even the most high-performance microprocessors at lower suppl

ages, despite the delay penalty that low supply voltage incurs. (Lowering Vdd goes hand-in-

hand with other scaling schemes, such as gate oxide reduction, line-width scaling, etc. V

reduction may be seen as a consequence of scaling, and attendant power reduction is a sere

dipitous result.) Other schemes, such as using multiple supply voltages (high voltage ra


Power Reduction in Integrated Circuits

clock

s are

n

thod-

n

th

vel

rs

ates

high speed sections, low voltages otherwise) have been proposed, although their use to date

has not been large.

Power reduction techniques applied to the design hierarchy for digital CMOS devices

can be subdivided as follows:

• Chip/floorplan level - At this point, the power characteristics for the entire die are

planned and power/delay ‘budgets’ are allocated. Vdd and Vth are determined based on

performance goals. Power due to the assembly of macroblocks (interconnect and

nets) is optimized, subject to delay and area/routability constraints. As macroblock

instantiated, this high level information is updated and budgets may be adjusted to

increase/decrease specifications on other sub-blocks.

• Macroblock level - This stage comprises the assembly of gates into a basic functio

such as a block of control logic, or an arithmetic unit. In a standard cell design me

ology, it is at this point that the implementer’s intellectual property enters the desig

flow. As such, the various methods of implementing a particular function impact bo

power and delay. The number of power optimization techniques available at this le

are numerous.

• Gate/circuit level - This is lowest level stage visible to the designer, where transisto

are assembled into basic gates. In a standard cell methodology, the fundamental g


used in macroblock assembly are defined here. Power optimizations are very restricted at

this level, as delay specification often dictates the power at which a device will operate.

Emphasis is placed on reducing device parasitics and area while maximizing routability. A

full custom approach may succeed in achieving lower power than a standard cell

approach, as individual transistor sizing may overcome certain obvious inefficiencies.

More aggressive circuit families, such as dynamic logic, dual rail logic, or low-swing

logic can be implemented at this stage, but these often require complicated design flows

(e.g., through the introduction of clocks, noise sensitive nets, the need for level conver-

sion, etc.)

In our work, we focus on a standard cell CMOS methodology, as this is the most common

method for quickly and efficiently assembling a digital integrated circuit. As such, a number

of clever and useful circuit techniques cannot be applied due to the complications which their

use introduces in the design flow. Therefore, the performance of designs achieveable by a

standard cell based implementation is sub-optimal, but the ease of implementation of such a

flow allows exploration of a wide range of designs.

1.2 Integrated Circuit Multiplication

Digital multiplication is one of the most basic functions in a wide range of algorithms[4].


Delay and Power in Multipliers

The ubiquity of this operation in computing has given rise to a large number of multiplier

implementations, each with different specifications and goals. Some applications require

wide dynamic range, others need high precision, while in some cases, neither of these char-

acteristics are very tightly specified. Digital multiplication is used as opposed to analog

when high precision is an issue; it is fairly straightforward to make digital multipliers as

accurate as the application requires. Precision required for multiplication varies by function.

At the low end, 8 bits are needed, e.g., in image compression algorithms, or 16 bits in more

precise DSP tasks. At the high end, we see 53 bit and 64 bit multiplication (IEEE double

precision standard[26].) Typically, we see 16 bit multipliers used for digital signal process-

ing and 53/64 bit multipliers used in microprocessors.

The basic operation in these designs is integer multiplication; in floating point multipli-

ers, integer multiplication units are sub-blocks of the greater floating point unit. Signed ver-

sus unsigned techniques have an impact on the design, and some clever techniques have

been suggested for manipulating the bit representation of numbers to generate power sav-

ings. However, the primary consideration in multipliers has been and continues to be delay.

1.3 Delay and Power in Multipliers

This thesis focuses on multiplication performed in digital signal processing (DSP) algo-


y

se

ata

per-

its

) is

rithms. Multiplications in this regime typically require a precision of 8 or16 bits. From a delay

perspective, algorithms place two constraints on multiplication: latency and throughput.

Latency is the real delay of computing a function, a measure of how long after the inputs to a

device are stable, is the final result available on the outputs. Throughput is a measure of how

many multiplications can be performed in a given amount of time. For a simple combinational

multiplier, throughput is a function of latency. However, various techniques exist which can

compute several multiplications in parallel, e.g., through pipelining; in these cases, latency is

only loosely correlated to thoughput.

For digital signal processing, throughput is a major concern. DSP algorithms are often

related to perception of audio/visual stimuli, for example, image or voice transmission and

recognition. In these tasks, precision requirements are less stringent than for other applica-

tions (e.g., numerical algorithms for scientific computing) so small bit-width multipliers may

be used—latency is a function of bit width, and small multipliers do not create long dela

paths. For many DSP applications, the relevant limiting specification is throughput. The

tasks often require fairly coarse resolution of images but operate on a large amount of d

representing different image or sound samples. For example, image rendering requires

forming computations on a large number of polygons, whereas the precision involved (b

required for identifying a particular color, bits required for identifying spacial coordinates


Delay and Power in Multipliers

opti-

d, the

one.

ou-

often

other

avior

fairly small. Voice compression requires a large number of 8 bit calculations. One way of

processing this large number of computations quickly can be achieved by lowering the

latency of the multiplication; in this manner, the multiplier can start performing the next

operation sooner.

A more efficient method to increase the number of computations is to increase the

throughput. Various schemes are possible; for example, pipelining/interleaving of data

allows one functional unit to compute several operations concurrently, while implementing

multiple devices on one chip simply increases the throughput by the number of additional

units. These techniques tend to be more efficient than latency reduction, because if one tries

to lower the delay of a circuit, diminishing returns are quickly encountered (if a circuit’s

transistors are upsized, at a certain point, the delay does not decrease further.) When

mizing throughput, on the other hand, for each additional functional block that is adde

number of operations which may be computed in a given amount of time increases by

Although pipelining is good for throughput, it may be hard to implement in tightly c

pled hardware/software systems. While the logic implementation of pipelining is fairly

straightforward, getting compilers to build programs to take advantage of pipelining is

difficult. The problem lies in setting up a series of operations to begin execution while

operations have yet to finish. These ‘parallel’ or ‘multiple-issue’ modes of system beh


have timing dependencies which complicate the task of writing a compiler that can take

advantage of such hardware techniques. Moreover, while some DSP algorithms lend them-

selves to parallel operation, others require processing to be more sequential in order, rendering

the additional pipeline hardware useless. Finally, extremely high speed code is often imple-

mented by hand in assembly language. Understanding all the methods of optimizing a pipe-

lined function can be very tedious if done manually. These reasons argue for using multiple

multiplication functional units on one chip, as opposed to implementing a heavily pipelined

multiplier[35].

The multiplier is a fairly large block of a computing system. The amount of circuitry

involved is proportional to the square of its resolution (i.e., a multiplier of size n bits, has

O(n2) gates.) Not only is the multiplier a high delay block, but it can be a significant source of

power dissipation. Based on the argument delineated above, that several multipliers should be

present on-chip as more DSP compute power is needed, the power dissipation involved in

multiplication will become more dominant. (Even if the pipelined approach is used, to a first

order, the pipelined multiplier will dissipate as much power as several multiplier blocks.

Although a pipelined version has fewer gates, it will still experience roughly the same amount

of switching.) Therefore, digital multipliers have become one of the prime circuits targeted for

power reduction [39][41][42].


Power vs. Energy

ay not

and

mount

ther-

a

dissi-

There-

c-

ity,

1.4 Power vs. Energy

The distinction between the terms power and energy is important to this discussion. Note

that energy is a measure of the total number of Joules dissipated by a circuit, whereas power

refers to the number of Joules dissipated over a certain amount of time. Properly speaking,

power reduction is a different goal than energy reduction.

Power can be a problem primarily when heat removal is a concern. If too many Joules of

energy are converted into heat within a short amount of time, a package’s heat sink m

be able to redistribute this heat quickly enough; the result will be a rise in temperature

subsequent thermal failure. If the same amount of energy is generated over a longer a

of time, power dissipation is reduced and the cooling structure can better deal with the

mal demands of the circuit. Here, power reduction consists of reducing the case when

large amount of energy is dissipated in a short amount of time. Again, the total energy

pated may remain the same.

Energy reduction consists of reducing the total number of Joules to be dissipated.

fore, we often speak of energy per operation as the metric to be optimized; time is not a fa

tor in this calculation. Power reduction, then, belongs to the domain of thermal reliabil

whereas energy reduction lies in the domain of maximizing battery lifetime.1

In digital CMOS, one often hears of a power-delay trade-off, or of a circuit operating at


or

g

ly,

tems

.

ll as

ed

net

h-

a point in the power*delay space. This power*delay is continually being improved, (e.g.,

using more advanced processes or better logic designs.) In a sense, this is a misnomer as

power*delay = (energy/delay)*delay = energy; this implies one should minimize energy—

more importantly, minimizing delay is irrelevant. Instead, one should speak of minimizin

energy*delay, as this metric involves two independent measures of circuit behavior.

The literature consistently refers to minimizing power, whereas the techniques described

in almost all cases involve minimizing energy. The two terms tend to be used interchangeab

with ‘power’ being used where ‘energy’ should be used instead. This usage most likely s

from this field of research being known as low-power circuit operation. We will retain the use

of ‘power’ because of its standard usage but will use ‘energy’ where clarity is warranted

1.5 Research Approach

In this work, we wish to identify the relevant components which impact power as we

delay of multipliers. As mentioned above, power as a figure of merit should be consider

concurrently with delay; a power reduction with proportional delay increase achieves no

advantage.2 In the first phase of our research, we focused on existing delay reduction tec

1. Note that power reduction is also relevant to battery operation. Various types of battery are more/less capable of providing bursts of high current[22].

2. All things being equal. We will not consider the argument of Chandrakasan, et. al.[6], where voltage scaling and functional unit parallelism are exploited to achieve a power savings.


Research Approach

niques for multipliers and looked at their power dissipation properties. The field of delay

reduction for digital multipliers is a well developed one which spans at least three decades.

Delay targets can be met using different delay optimizations, and in some cases, multiple

methods may be used concurrently. It is of great interest to identify which techniques should

be applied to reduce delay if one also wishes to minimize power consumption. Those

approaches which show the most promise, as well as those which have not been previously

explored are naturally the subject of our most intense focus. In the second phase of this

work, we attempted to expand the repertoire of possible techniques by investigating ideas

for power reduction which suggested themselves from the initial analysis. Not all the ideas

which we implemented were successful. Nevertheless, we did identify a few techniques

which show promise for power reduction in multipliers.

Our experimental procedure was based on the standard cell design methodology which

is common in industry today. We started from a small library of basic CMOS cells necessary

to implement our functions. These were extensively characterized for energy dissipation and

delay under a wide range of input slopes and output loads. The multiplier logic was derived

from standard descriptions existing in the literature, and the actual logic instantiation was

performed automatically using assembly algorithms derived from the literature, along with

some optimizations which we contributed to the process. Once the multiplier had been


e

r tools

ndor

this

assembled, the contribution of physical effects was estimated using a placement and routing

stage, which attempted to determine rough wiring characteristics based on routing length. In

later experiments, results were verified by performing a full layout using the Cadence tool

suite.

Given the logic and physical description of the function, timing analysis was performed

using static delay information from the characterized standard cells. Power was determined

using a logic simulator which counted switching events and estimated glitches. Additional

power numbers were derived from runs using HSPICE and Star-sim from Avant! Corpora-

tion[51][52].

Our design methodology evolved over time, as our tools were refined and accuracy was

improved. Final values were checked using vendor tools where appropriate. Two goals moti-

vated the design of this CAD methodology. First, we wanted to perform certain tasks which

were not possible using vendor tools—some of these, such as multiplier assembly, wer

clearly necessary as this represents the contribution of this thesis. Second, while vendo

performed some of the tasks necessary for this work, (e.g., Synopsys - Design Compiler could

have been used for static timing analysis), we required several slight modifications to ve

tool functionality (e.g., timing analysis computing the static longest path) which were difficult

to achieve. Although coding such procedures increased the time necessary to complete


Thesis Outline

on

d

r effi-

project, we also received the benefit of having access to these procedures at a very fine

detail (for example, we were able to incorporate timing code in the inner loop of our multi-

plier assembly algorithm.)

1.6 Thesis Outline

This thesis can be divided into three separate projects, which investigated different

aspects of multiplier power dissipation. The overall thesis is organized as follows.

• Background: multipliers and power dissipation - (chapter 2): we first describe the

domain of our investigation, power dissipation in multipliers. We present a description

of digital multipliers and their component parts, along with optimizations that have been

proposed to speed up the operation. Furthermore, we describe the state of the art in

power reduction for multipliers. This is followed by an analysis of power dissipation in

static CMOS digital logic, the standand form used to implement logic functionality.

Some basic techniques for CMOS power reduction are briefly described.

• Arrays vs. Wallace trees applied to partial product reduction - (chapter 3): our initial

work investigated two approaches to multiplier partial product reduction, focusing

the power dissipation characteristics of these two. Prior to our investigation, we ha

found contradictory data, suggesting that each of these two forms was more powe


es

. Our

s. The

llace

s

ly anal-

le

e

dissipa-

g this

e

ore,

cient than the other[37][38]. By considering switching activity and parasitic capacitance

due to interconnect, we were able to demonstrate which of the two designs offers the most

promise for low power operation.

• Transition retaining barriers - (chapter 4) we developed an idea for transparent latch

to “re-synchronize” signals propagating through partial product reduction techniques

goal was to compare the potential for this technique in both arrays and Wallace tree

analysis for arrays exists in literature[45], so we attempted to apply this method to Wa

trees. Unfortunately, the characteristic behavior of signal propagation in Wallace tree

makes this approach unsuitable, and power reductions were not obtained in our ear

ysis.

• Inverse-polarity techniques - (chapter 5): this approach investigated the potential of

removing inverters in adder blocks within multipliers. The technique comes from ripp

adders, which use this optimization to reduce logic depth as a delay optimization. W

adapted this technique to attempt transistor count reduction, so as to reduce power

tion. Our experiments show that power reductions of up to 25% can be achieved usin

technique. Various characteristics of the logic function, such as delay, area, and nois

properties are impacted when using this method and are further described. Furtherm

we attempted to verify layout estimated by performing full physical design on an


Thesis Outline

advanced process. Our analysis shows that our estimates were very close to data

obtained through complete design of the multipliers.

We conclude the thesis with some general observations about power reduction for multi-

pliers. There are several unexplored prospects for power reduction which warrant further

investigation, including a few proposed by results described in this thesis, which we will

discuss in this last section.


2 Background: Multipliers and Power Dissipation

In this chapter, we discuss some of the basic concepts which are common to the areas

which we investigated. First, we present a brief description of digital multipliers, including

their structure and relevant components. Delay reduction techniques are also discussed.

Next, we go over power dissipation in CMOS circuits, along with some basic techniques

which can be applied to reduce power.

2.1 Multipliers

In order to understand delay and power trade-offs in multipliers, we will describe the

basic circuit structure of digital multiplier implementations in more detail. We wish to

examine some of the techniques which have been developed to reduce multiplier delay, par-

ticularly to gain an understanding of their characteristic power dissipation.


While some insight can be gained through direct observation of logic structure, power dis-

sipation comes from several sources; techniques which reduce the power due to one of these

sources can worsen the power dissipation due to another. A brief discussion of sources of

power dissipation in digital CMOS will illustrate the relevant effects. To incorporate all the

relevant power dissipating effects into our analysis, we chose to evaluate multiplier designs by

developing a methodology which uses low-level circuit simulators to calculate power con-

sumption. To compensate for the slowness of a detailed circuit-based approach, a simulation-

based evaluation program allowed quick analysis of power and delay for these designs, and

accuracy was verified using the detailed circuit simulators.

2.1.1 Multiplier StructureAt the most basic level, digital multiplication can be seen as a series of bit shifts and bit

additions, where two numbers, the multiplier and the multiplicand are combined into the final

result. Consider the multiplication of two numbers: the multiplier P, and multiplicand C,

where P is an n-bit number with bit representation {pn-1,pn-2,...,p0}, the most significant bit

being pn-1 and the least significant bit being p0; C has a similar bit representation {cn-1,cn-

2,...,c0}. For unsigned multiplication, up to n shifted copies of the multiplicand are added to

form the result. The entire procedure is divided into three steps: partial product (PP) genera-


Multipliers

te,

tion, partial product reduction, and final addition. This is illustrated conceptually in Fig. 1.

2.1.2 Partial Product GenerationThe initial step in digital multiplication is to generate n shifted copies of the multipli-

cand which may then be added in the next stage. Whether a given shifted copy of the multi-

plicand is added depends on the value of the multiplier bit corresponding to this

multiplicand copy. If the ith bit, (i = 0 to n-1) of the multiplier is ‘1’, then the multiplicand is

added. If the bit is ‘0’, the multiplicand is not added.

The bit representation of this function can be implemented using a logical AND ga

Partial Product Array Generation = n shifted binary numbers

Final addition = 2n-bit final product

n-bit inputs operands (n = 4)A B

Partial Product Array Reduction = reduction to 2 binary numbers

Figure 1. Digital multiplication flow.


which performs AND(ci,pj), i = 0 to n-1, j = 0 to n-1. The resulting values are called partial

product bits or simply, partial products; if we line up these partial products by bit position

(bp = i + j), we arrive at a structure shown in Fig. 2. In this design, the partial product bits are

arranged in columns, which are to be added together to form the final value. The resulting

trapezoidal structure is called a partial product array or simply PPA. This step is called partial

product array generation.

Various forms of partial product arrays exist, depending on the number representation. For

example, in the following section we describe a common variant called Booth recoding,

which allows a signed multi-bit partial product representation of the design. Common variants

=

ppij

multipliermultiplicand

c0c5 p0p5

columns to be added

Figure 2. Partial product generation for 6-bit by 6-bit multiplication.


Multipliers

n the

par-

array

por-

li-

lumns

uire

t the

ult in

the

rations,

sion of

ipli-

lobal

for efficiently implementing two’s complement are described in [27].

There are several important points to notice about the partial product array. First, i

most basic formulation (PPA bits generated via logical AND), all the bits are created in

allel; that is, the static delay of each of the bits is equal. Second, the dimensions of the

are functions of the size of the multiplier and multiplicand: the height of the array is pro

tional to the size of the multiplier, and the width is proportional to the size of the multip

cand. Finally, all the bits in a particular column are to be added together, and some co

have fewer bits than others. Specifically, low-order and high-order bit positions will req

fewer additions than the middle bit positions; slightly more additions will be required a

high-order positions than at the low-order ones, as carries from low order positions res

a larger number of bits to be added at the high order bit positions.

2.1.3 Partial Product ReductionThe heart of an efficient digital multiplier implementation is in the manner in which

PPA bits are added. Were conventional carry adders used to implement these add ope

the delay of all the adders would consume a large amount of time, as each shifted ver

the multiplicand would contribute a delay which is proportional to the width of the mult

cand. Instead, the partial product is reduced using a technique called carry-save

addition [31]. This procedure allows successive additions to be incorporated into one g


addition step.

Consider a numerical bit vector representation of the following form: (bn-1,bn-2,...,b1,b0),

bi = {0,1}, . If we wish to add two bits from two bit vectors, say a0 and b0, from bit

vectors a and b, we can use the full adder (Fig. 3a); it takes in three bits and produces a sum

bit, and a carry bit. When adding two vectors together, this block can be used to add two bits

at a given bit position with the carry-in from the previous bit position. Consider the case

where two bit vectors are to be added. At the lowest bit position, two bits are added, and the

carry is propagated to the next bit position. From then on the carry-in and the next two bits at

the higher bit position are combined, and a carry-out is generated. Using this rippling tech-

nique, we see that adding two n-bit number takes O(n) sequential bit additions, resulting in a

delay of O(n).

If we have to add three bit vectors, A, B, and C, each of size n, we can use this method to

add first A and B, and then to add C to the result of A+B. The number of bit additions is O(2n).

We see that if we were to use this technique in the most simple-minded fashion to add n

shifted copies of an n-bit multiplicand, the delay will be O(n2). This occurs because we

assume that the add operations are dependent on previous add operations (the output of earlier

operations are inputs to later operations). See Fig. 3.

0 i n<≤


Multipliers

Although the final result comes about from combining all add operations, a certain

amount of independence exists between each operation. We can consider the add operation

on a column by column basis; all the bits in a column must be added together, along with the

carry-in bits of the previous column. The delay in calculating the output of a given column

is a function of when the carry-in bits (from the previous column) are available, as well as

the number of bits which are to be added (Fig. 3.) Carry-save addition leverages the fact that

the add operations in separate columns can be performed independently. If we are to add

three vectors of bits, we can use full adders to add the three bits in each column. The result

is a carry and a sum bit at each bit position, except at the lowest and highest bit position,

b0 a0

....

b1 a1b2 a2

F A

ba

cincout

sum

FAFAFA

sum0sum1sum2

....

(d)

(a) (b)

(c)

Figure 3. Partial product addition. (a) Full adder cell. (b) Basic ripple adder. (c) We can use ripple addition to add all the shifted copies of the multiplicand. (d) Since there are n-1 ripple adders, each of width n, this basic method takes O(n2).


ectors,

rs are

r

save

sec-

ltipli-

4b.

rs.

r row

stan-

which have one bit each. We see that three bit vectors have been “reduced” to two bit v

in the delay necessary for a full adder. Using this technique called carry-save addition, we can

’reduce’ a set of vectors to be added together down to two bit vectors. Since the full adde

the basic addition element, full adders used in this context are often called carry-save adders

or CSA’s. Using carry-save addition is the prime reason why multiplication is much faste

than would be predicted by the number of additions necessary.

2.1.4 Array Style ReductionGiven the trapezoidal array of partial product bits which must be added using carry-

addition, there exist several ways to implement a partial product reduction adder. In this

tion, we describe the most basic method, called array partial product reduction.

For example, in Fig. 4a, we see the trapezoidal PPA generated for a 6-bit by 6-bit mu

cation. We can take the first three bit vectors, and add them using full adders, as in Fig.

Combining the results of the addition with the remaining bits of the PPA, we get a result

which appears in Fig. 4c. As we can see, three vectors have been reduced to two vecto

Note that the outputs of this first set of full adders can now serve as inputs to anothe

of full adders, along with another bit vector. This structure repeats until the full array is in

tiated as in Fig. 4d.


Multipliers

The notable characteristic about the array architecture is its regular structure. This has

the advantage that it is very easy to lay out, as a single adder block and associated connec-

(a) (b)

(c)

FAFAFAFAHA HA

FAFAFAFAHA HA

FAFAFAFAFA HA

FAFAFAFAFA HA

FAFAFAFAFA HA

(d)

Figure 4. Array partial product reduction- (a) the initial partial product (b) using a row of carry-save full adders to reduce 3 bit vectors down to two (c) resulting PPA (d) full array structure.


e three

edure

can

tions are replicated the width and depth of the array. The delay of this block is a function of

the number of rows, O(n), which is a big improvement over the simple-minded scheme of

using conventional adders for each row. However, it is possible to do better.

2.1.5 Wallace Tree Partial Product ReductionIn 1964, C.S. Wallace[23] observed that the later stages of the array structure must always

wait for all the earlier stages to complete before their final values will be established. When

performing a series of independent add operations, it is possible to create a structure which

has less delay by performing the addition operations in parallel, where possible. For example,

in the partial product array for 6-bit x 6-bit multiplication, two carry-save reductions can be

done in parallel, resulting in a smaller PPA after just one step (Fig. 5a-c.) This can be repeated,

yielding the structure shown in Fig. 5d. (The final set of connections is somewhat complicated

to draw here.)

Obviously, parallelizing the carry-save operations will yield a delay which is much shorter

than the array’s sequential series of operations. When using carry-save addition, we tak

input bit vectors and reduce these to two output bit vectors. A sequential carry-save proc

will reduce the number of bit vectors by one at each stage, whereas the parallel method


Multipliers

take sets of 3 vectors and reduce them to sets of 2 vectors. The number of stages, and hence

the delay of the sequential operation will be O(n), whereas the parallel method will be

FAFAFAFAHA HA

FAFAFAFAHA HA

(a) (b)

(c)

(d)

Figure 5. Wallace tree partial product reduction. (a) partial product array (b) parallel carry-save addition (c) resulting PPA (d) complete Wallace tree structure.


se-

ts are

oved

t

e. In a

O(log3/2 n).

Such parallel arrangements of CSA blocks are called Wallace trees and allow for a large

reduction in the delay of the partial product reduction stage. The disadvantage of Wallace

trees lies in their irregular layout (especially with respect to array structures), resulting in

potentially greater wire loads. Furthermore, note that array structures result in a final add

operation of width n, whereas the final adder in Wallace trees has a width of approximately

2n - log3/2 n.

2.1.6 Partial Product Reduction/Generation Using Booth RecodingThe technique of Booth recoding is based on the observation that under certain conditions,

namely when a bit in the multiplier is ‘0’, a bank of carry-save adders does not perform a u

ful function; that is, a ’0’ is added to the accumulated carry-save result, and the input bi

simply propagated to the output bits. In this case, these carry-save adders could be rem

from the multiplier structure, resulting in delay and power savings. Unfortunately, it is no

generally possible to know a priori what bits of the multiplier will be ‘0’. To maintain gener-

ality, we must provide for the case when all bits of the multiplier are ’1’.

It may be possible to reduce circuitry, however, if one considers the largest delay cas

4 bit x 4 bit multiplication, circuitry must be provided for the case where the multiplier is

‘1111’---resulting in a delay of 4 stages. An important observation is that multiplying by


Multipliers

fore,

com-

ed

m

‘1111’ is the same as multiplying by ‘10000’, and subtracting the multiplicand from the

result--- multiplying by a power of two is simply a shift, so this costs two stages. There

we have cut down our worst case from 4 stages of delay to 2 stages of delay.

This type of stage reduction can be generalized into the technique known as Booth

recoding. Three bits of the multiplicand are used to determine whether a shifted and/or

plemented copy of the multiplicand are to be used. Two bits of the multiplicand are us

multiplexed to created the actual value. The encoding scheme is shown in Table 1, fro

[33].

Table 1: Booth encoding

x2i+1 x2i x2i-1 di

0 0 0 0

0 0 1 1

0 1 0 1

0 1 1 2

1 0 0 -2

1 0 1 -1

1 1 0 -1

1 1 1 0


ding is

here

rac-

more

The connections of the Booth multiplexors are shown in Fig. 6.

The net result is that the size of the PPA array is reduced, since fewer shifted copies of the

multiplicand are necessary in the partial product array. The number of these multiplicand

‘copies’ needed in the partial product array depends on the degree to which Booth reco

applied. Generally, each level of recoding cuts the number of partial product bits in half. T

is additional circuitry involved in performing the recoding however, so this optimization

entails inserting complicated logic which itself adds delay and power consumption. In p

tice, Booth recoding is applied to one level (called radix-4) or two levels, but hardly ever

multiplier

booth decoders

booth encoders

multiplicand2 2 2

33

3

3

3

Figure 6. Booth recoding (radix 4).


Multipliers

than two levels.

Booth recoding was initially applied to array multipliers, where a reduction in the size of

the PPA by two means reducing the delay of the partial product reduction stage by a factor

of two. Note, however, that cutting the size of the PPA in half has less of an impact on the

Wallace tree scheme, since Wallace trees reduce the size of the PPA at a logarithmic rate; the

savings from Booth recoding yield a reduction of 1-2 levels of logic. Furthermore, there is

logic involved in the actual Booth recoding process. Therefore, it is unclear whether there is

an advantage in applying Booth recoding to Wallace trees. Doubts have been expressed

about the validity of Booth recoding for Wallace trees even for 64-bit x 64-bit multipli-

ers[28].

2.1.7 Final AdderA common method for achieving low delay in multipliers is to speed up the final addi-

tion stage. Several optimizations exist for performing high-speed addition, as summarized in

[31]. The straightforward application of these designs to multipliers has resulted in various

designs with high speed or low power [37]. In our analyses, we considered carry ripple,

carry skip, and carry select adder structures for this final addition step.

Ripple Adder

As noted previously, the ripple adder is the slowest yet lowest power adder implementa-


1’; if

a rip-

e

tion. The ripple adder is of great interest because higher speed adders often incorporate the

ripple structure into sub-blocks of the greater high speed adder structure. Therefore the speed

and power properties of the ripple adder determine the overall performance of a wide range of

adder designs. The structure of ripple addition is shown below in Fig. 7. As can be seen, the

delay is linear in the number of addition stages.

Carry-Skip Adders

Carry-skip adders are based on the principle that only under certain conditions can a bit

ripple all the way through an adder, from a low bit position to a high bit position. Assuming a

carry-in of ‘1’, all pairs of bits to be added must contain at least one bit whose value is ‘

this condition is not met, the rippling action stops. If the condition holds, any carry-in to

pling block will be propagated to the carry-out, and the output will be ‘1’. In this case, th

carry-in ‘skips’ the rippling block. For the ripple adder above, the skip condition is: (a0 +

b0)(a1 + b1)(a2 + b2)(a3 + b3).

FA FA FA FA

a0 b0 a1 b1 a2 b2 a3 b3

s0 s1 s2 s3

c0 c1 c2c3cin

( = 0 )

Figure 7. Ripple adder


Multipliers

’

is

n the

ome

[33].

A carry-skip adder is shown below in Fig. 8. The add operation is divided into ‘skip

blocks, and the computation is performed for each block. A conventional ripple adder

included with the skip block to form the final adder. The delay of this block depends o

size of the internal skip blocks as well as the delay for the skip condition calculation; s

research has been done on determining the optimal arrangement of carry-skip adders

Ultimately, carry-skip adders are best used in technologies where rippling is very fast.

FA FA FA FA

a0 b0 a1 b1 a2 b2 a3 b3

s0 s1 s2 s3

cin

skip

cout

FA FA FA FA

a0 b0 a1 b1 a2 b2 a3 b3

s0 s1 s2 s3

skip

FA FA FA FA

a0 b0 a1 b1 a2 b2 a3 b3

s0 s1 s2 s3

skip

FA FA FA FA

a0 b0 a1 b1 a2 b2 a3 b3

s0 s1 s2 s3

skip

Figure 8. Carry Skip Adders: (a) One carry-skip block (b) 12-bit adder.


’ at the

e

plot

Carry-Select Adder

The carry select adder optimization is based on parallel computation: for a sub-block of

bits to be added, two ripple adders are instantiated, one which assumes a carry-in of ’0

lowest bit position, while the other assumes a carry-in of ’1’. When the actual value of th

carry-in to the block is known, the correct ripple adder is selected via a MUX (Fig. 9a). A

of the delay by bit position is shown in Fig. 9b.

FAFAFAFA 0

FAFAFAFA 1

b0a0b1a1b2a2b3a3

muxmuxmuxmux cinmuxcout

s0s1s2s3

(a)

delay

0247

bit position

c00,c01

c10,c11

c30,c31

cout = c3

c20,c21

(b)

inputs arrive

max delay

Carry Select Adder:

(c)

Figure 9. Carry Select Adders: (a) Basic structure of adder, (b) delay chart, and (c) construction for minimum delay.

delay

11

0123


Multipliers

The carry select adder speeds up addition by implementing blocks of ripple adders

which operate in parallel. The overall final result comes from a chain of MUX elements

which choose the correct sequence of input carries. Although carry select adders achieve

delay reductions, they tend to be higher in power due the greater amount of circuitry needed

to calculate the addition. A low delay implementation of the carry-select adder can be con-

structed as follows. Starting with a basic ripple adder of two bits at the lowest bit position

(bits 0 and 1), a carry select block is constructed at bit positions 2 and 3. MUX elements

have approximately the same delay as a stage of the carry chain. Therefore, the next carry-

select block will be of length 3, at bit positions 4, 5, 6. Carry select blocks therefore start at

bit positions 2,4,7,11,...and so on. This is shown conceptually in Fig. 9c. In this formulation,

the delay of the adder will be proportional to the number of carry-select blocks.

A final type of adder is the Carry-Lookahead Adder, which is one of the fastest tech-

niques to perform addition. A brief discussion can be found in [31].

Adder Comparisons

Adder implementation characteristics are summarized in Table 2, taken from[31]. There

is an area-delay trade-off between adders which is partly shown in this chart. Although the

asymptotic delays of carry skip adder and carry select adders are similar, the carry select

tends to be faster but larger than the carry skip adder, while the carry skip adder tends to be


slower but smaller than the carry select adder. This fits the size and delay trends when com-

pared to the ripple adder.

2.1.8 SummarySince multiplication is a very complex operation, a number of optimizations have been

devised to reduce its delay. The multiplier can be speeded up by reducing the delay in several

of its component blocks. Recent considerations of power dissipation may bias which ways the

multiplier delay is reduced.

2.2 CMOS Power

The implementation of a multiplier in CMOS digital logic involves various trade-offs

which are particular to the technology. For example, a bipolar implementation would dictate

minimizing circuitry due to the static power dissipation component inherent in bipolar gates.

CMOS on the other hand does not suffer from as significant static power dissipation, and thus

is more amenable to adding devices.

Table 2: Asymptotic time and space characteristics

Time Space

Rjpple O(n) O(n)

Carry skip O(n)

Carry select O(n)

O n( )

O n( )


CMOS Power

2.2.1 Static vs.Dynamic PowerThere are two modes of power dissipation in integrated circuits: power generated during

static operation versus that generated during dynamic operation. Static power dissipation is

a function of all currents which flow when no switching is occurring. These include currents

due to pn-junctions, static current due to biasing of devices, and leakage currents. Dynamic

power is a result of switching activity, when currents cause capacitances to charge or dis-

charge while performing logic operations. Dynamic current can be large or small depending

on desired delays and capacitances present.

Static CMOS has become the logic family of choice for digital circuit implementations

for several reasons, one being desirable power dissipation characteristics. Due to the com-

plementary nature of the PMOS and NMOS devices, CMOS has low static power dissipa-

tion as one device is generally off when the other is on. Furthermore, there is no D.C. input

current, meaning that very little current flows when the device is not switching. Dynamic

power dissipation in CMOS can be described by the equation in Fig. 10.


Static power dissipation in CMOS is due to leakage currents. Reverse biased pn junctions

form a part of this current, although their contribution is relatively small. More problematic is

subthreshold leakage, where current flows across a transistor when it is nominally "off". For

example, if an NMOS device has 3.3V on its drain, and 0V on its gate and source, ideally no

current flows. In reality a very small current is present, which is an exponential function of the

gate-to-source voltage. While this current is very small (~10 picoAmperes for an NMOS

device in a .35µm process), process shrinks have shown that this current has been increasing

in successive generations (for the same family of processes as above, subthreshold leakage is

~1nAmp at .25µm). Considering the large number of devices on a die, this effect may contrib-

ute substantially to power dissipation for future processes.

Dynamic power dissipation is a function of the behavior of CMOS during logic opera-

tions. Two primary currents are present, one which goes to charge capacitances of devices at

P = αCVdd2 + KIVdd

α - activity factor. Number of transitions (per operation, e.g., in one cycle.) C - switched capacitance. Vdd - supply voltage. K I - static current component, K ~ W/L of PMOS & NMOS devices, input signal slope.

Figure 10. Equation for CMOS power consumption


CMOS Power

itch-

vices

und,

of

ell as

f

harg-

the output of a gate, and a second ’parasitic’ short circuit current. During part of the sw

ing time, when the input voltage is between the supply rails, both PMOS and NMOS de

can conduct. This results in a current flowing directly from the supply voltage rail to gro

called “short circuit current”, “crowbar current” or “totem-pole current”. The actual value

the current is a function of the conductances of the PMOS and NMOS transistors, as w

the slope of the input signal. In a well-designed circuit, this should contribute < 10% o

overall power [6].

The main source of power dissipation in CMOS, that used in the charging and disc

V in

Ishort

time

PM OS on

NM OS on

Figure 11. Short circuit current occurs when CMOS devices switch. If the input of the gate is in the region where PMOS and NMOS devices are both on, current will flow directly between the rails. Short circuit current can occur if the gate is not held close to the Vdd or Vgnd rails.

Vdd

Gnd

Ishort


ing of capacitances, accounts for 70-90% of overall power dissipation and is therefore the

main target of power optimization.

2.2.2 Power Optimization FundamentalsTo understand the prospects for power reduction in multipliers, we will describe

approaches which can be taken to reduce power dissipation in CMOS logic structures. Quite a

number of techniques are applicable to multipliers, although their practicability limits their

use in industry.

Three factors are relevant in dynamic power calculations: supply voltage, capacitive load

and activity factor. The first is the voltage to which capacitors are charged, typically the sup-

ply voltage Vdd. Methods which focus on the second, capacitance reduction, are popular as

lowering capacitance benefits both power and delay. The third target is activity reduction,

which stresses minimizing the number of times a node switches in a given period of time.

Careful design of delays in circuits is the primary method for reducing activity.

Voltage

Downward voltage scaling typically accompanies process shrinks and has a great effect in

reducing power dissipation, since power is a function of the square of the voltage

(P ~ αCVdd2). Techniques which further lower voltages have been successful in minimizing


CMOS Power

com-

e, and

f

n the

ower-

power dissipation, but voltage reduction can result in a corresponding increase in delay.

Supply and threshold voltage are inversely proportional to delay (D ~ 1/[Vdd - Vt]), so a

decrease in the supply voltage will give a lower power, but slower circuit—this can be

pensated by a reduction in the threshold voltage, Vt. However, lowering the threshold volt-

age causes an increase in power (both static and dynamic), increases current leakag

compromises noise margins. Where to set Vdd and Vt in a given process continues to be a

subject of investigation.

Several methods have been proposed for varying Vt during operation through the use o

the "backgate effect", also known as the "body effect". By putting a negative voltage o

substrate (Vsb > 0), the effective threshold can be raised thereby increasing delay and l

ing power consumption (see Fig. 12).

gatedrainsource

substrate

+-Vsb

Figure 12. Biasing for Vt modification via backgate effect


Designs are based on a low Vt, low delay and high power implementation; when one

wishes the circuit to go into "sleep mode", the backgate voltage is lowered and the circuit

becomes low power.

Unfortunately, the above technique has two problems. The amount by which Vt is reduced

is a function of the square root of Vsb, that is: Vt ~ Vt0 + γ(Vsb)1/2. Therefore, if one wishes to

strongly turn off the device, there is a trend of diminishing returns. A second problem relates

to process scaling: as transistor dimensions have gone down, the [γ] parameter in the above

equation also scales down. In successive processes generations, we see a reduction in the abil-

ity to control the threshold voltage via the body effect.

Similar methods for varying the voltage of the supply rails have been proposed and imple-

mented. The simplest of these involve having a servoing control on the value of the voltage

rails, and controlling the value appropriately based on whether the circuit has to have low

delay or low power. Other techniques, such as having portions of the circuit using different

sets of power rails have shown success[18].

Capacitance

As mentioned previously, capacitance reduction has been a primary goal of circuit design-

ers since lowering capacitance also contributes to delay reduction. For power minimization, it


CMOS Power

ate

tercon-

n is to

iving

is min-

n sig-

gate

y and

is important to consider capacitance and activity factor together. For example, a high capac-

itance section of circuit which does not switch very often may contribute less power than a

low capacitance point that experiences high switching activity.

Capacitance in integrated circuits can be classified into three groups—these are g

capacitance (inputs of devices), parasitic capacitances (internal nodes of gates) and in

nect capacitance. The basic approach for capacitance minimization in cell-based desig

create a library of various sized cells, minimizing parasitic capacitances during layout

through careful physical design. Gate capacitance is a fixed function of the desired dr

strengths of each cell. During system level place and route, interconnect capacitance

imized by placing gates close together and avoiding high amounts of coupling betwee

nal lines.

When using a standard cell gate library, it is important to note that not all inputs to a

which have the same function (e.g., NAND inputs, NOR inputs) exhibit the same dela


power characteristics. The variance in delay as a function of the inputs is well known, but

there is also an effect on power. By having certain signals arrive later than others, it is possible

to minimize power dissipation caused by inadvertent charging and discharging of parasitic

capacitances.

To demonstrate this parasitic charging effect, consider the 3-input NAND gate, shown in

Fig. 13. If one models the pulldown devices using a resistor to represent the on-resistance of

the NMOS devices, with parasitic capacitance at the drain and source of each device, we see

three points of capacitance, one at the output and two within the pulldown chain.

The charging behavior of each point depends on the arrival times of the input signals.

Consider two cases where the inputs switch from 000 to 111. In both cases, the output goes

from high to low, but in one case the parasitic capacitances are charged up prior to being dis-

Vdd

Gnd

X Y Z

X

Y

Z

Figure 13. Pulldown stack of 3-input NAND


CMOS Power

charged, whereas very little charging of parasitics occurs in the second case. (see Fig. 14).

Such input ordering can result in power savings of 10-20%, depending on values of the

capacitances[13].

Activity

A large amount of power is dissipated when a node switches several times prior to set-

tling to its final value. This is due to unequal arrival times on the inputs of devices, causing

several state transitions before the output settles to its final value.

X

Y

Z

X

Y

Z

Figure 14. Switching behavior for different input arrival times


In Fig. 15, the bottom input of the AND gate is delayed by the inverter. Although the final

value of the output is the same as the initial value, the gate output may undergo a spurious

transition, due to the delay introduced by the inverter. Such a transition is called a "glitch" or

we say the output node experiences "false switching."

Glitch minimization focuses on attempting to equalize signal arrival times at the inputs of

a gate. There are several techniques which can be applied to achieve this result. For example,

buffers can be inserted which introduce delays in fast signal paths. This causes signals travel-

ling along these paths to slow down, allowing inputs to transition at similar time points.

Another technique is to introduce a row of latches, which are then triggered at the same time,

thereby "filtering" out fast signals. Otherwise, the logic may be resynthesized, attempting to

generate paths whose delays are more "balanced" in terms of arrival times at the inputs of

gates. Note that some of these techniques modify the logic by adding elements (buffers,

latches), and one must be careful that the extra power of these devices is compensated by a

Figure 15. Output glitching.


Multiplier Power Reduction

larger amount of overall power savings.

Design for path balancing is a difficult goal due to conflicting effects when gates are

sized. Given a logic structure, one wishes to size the gates to minimize power, subject to a

delay constraint. Power can be reduced by reducing the size of gates, thereby reducing

switched gate capacitance. However, path balancing attempts to increase delay on fast paths

by reducing gate capacitance, causing risetimes and falltimes to be much longer, corre-

sponding to a slow signal slope. The problem lies in increased short circuit current as the

signal slope is reduced; both NMOS and PMOS are turned on for a longer time. Trading off

these effects is a very difficult optimization problem. Some solutions have been proposed,

for example starting from a design with gates of minimum size, and upsizing the gates based

on a static timing analysis until delay constraints are met[49]. This problem continues to be

a focus of power optimization at the transistor level.

2.3 Multiplier Power Reduction

The design of digital CMOS until recently has focused on delay reduction and it is only

recently that power dissipation has gained prominence. In multipliers, delay increases as the

size of the multiplier grows (in terms of bits, e.g., a 16-bit by 16-bit multiplier) but can vary

depending on implementation. Power is proportional to the amount of circuitry present in


the multiplier and how that circuitry is connected to perform the multiplication. The amount

of adder blocks comprising a multiplier is proportional to the square O(n2) of the size of the

multiplication. Therefore, multipliers tend to be fairly large, power consuming blocks.

To a first order, both power and delay can be minimized by using the smallest multiplier

design available. Therefore, while microprocessors use 53 and 64 bit multipliers according to

the IEEE standard, DSP multipliers have sizes in the range of 8, 16, or 24 bits. Often, the mul-

tiplier will perform a multiplication on two numbers of a certain resolution (for example, mul-

tiply two16 bit numbers), but will incorporate the resulting value in a final add of greater

resolution (e.g., 24-bits). This is a consequence of the carry-save methodology prevalent in

most methods for performing partial product reduction, as well as the need in many DSP algo-

rithms to perform numerous sequential multiply operations which are accumulated together.

Typically, power and delay minimization techniques focus on the various sub-blocks com-

prising a larger block and address power optimization in each of these independently. How-

ever, dependencies between these sub-blocks affect the overall power characteristics, and

some benefit can be gained from an integrated approach. We will initially consider basic

blocks in a multiplier separately, then describe how interdependencies affect overall opera-

tion. (An example of a power/delay reduction technique which lends itself very well to the

interaction of various sub-blocks is presented in the final chapter.)



We can divide power analysis and optimization of multipliers along the lines of the

design hierarchy. We initially focus on circuitry used to implement the logic functions, the

design of the logic functions comprising the multiplier, and the architecture of the multiplier

as a whole.

2.3.1 Multiplier Circuitry The basic circuitry used to implement the multiplier is defined according to process

technology. Currently, the vast majority of digital logic is implemented in standard CMOS.

Aggressively low power designs attempt to used adiabatic techniques[13] for low power

operation, although these compromise delay to achieve the power gain. Low swing logic is a

technique which can be usefully applied to general CMOS circuits[17][18], and has been

applied to multipliers in particular. Although a wide range of circuit techniques exist for

implementing fast and/or lower power arithmetic components [37], standard CMOS contin-

ues to be the circuit of choice[39]. Very aggressive high speed designs use dynamic

CMOS[34][36], which unfortunately is also power-intensive. Our focus will be on circuits

designed in standard static CMOS.

2.3.2 Logic Level Multiplier OptimizationSince the components comprising a multiplier (PP generation, PP reduction, final adder)

have been fairly standard, and as this decomposition is recognized as one of the best, if not


por-

ess fre-

vings.

n

par-

d then

r.

the

the best way to implement high speed multipliers, a good deal of effort has been spent on opti-

mizing the power of these components. We will describe some of these investigations.

Multipliers can be constructed which take advantage of special characteristics of the num-

bers which they are multiplying. For example, in [41], it is observed that FIR multipliers, the

coefficients used in multiplications do not change values. It is empirically observed that Booth

recoded multipliers can be implemented to be low power if the coefficients are used for the

multiplier (which is then encoded), as opposed to the data inputs. In this case, a lower number

of transitions results, due to different characteristics of the input. Similarly, transitions reduc-

tion can be achieved in 4-2 reduction trees by noting that two outputs have the same

’weight’[40]. Therefore, a degree of freedom exists in circuit implementation, which is im

tant since one of the outputs typically has greater output load. It is possible to assign a l

quently transitioning output to the more heavily loaded output load to achieve power sa

Different circuit arrangements are proposed which attempt to minimized output transitio

probabilities and their propagation through the circuit.

Booth recoding suffers from the problem that unequal delay paths exist in the Booth

tial product generator. One path goes from the multiplier, through the Booth encoder, an

to the Booth decoder, while paths from the multiplicand go directly to the Booth decode

Since the Booth decoder is typically two gates deep, a glitch can result at the output of



h is

multi-

r. This

y does

le, the

artial

ul

raged

o a

the

less,

s is

tial

ork

Booth decoder due to this greater delay path (see Fig. 5). One approach is to redesign the

Booth encoders/decoders[42] in the following manner: the Booth encoder’s logic dept

reduced and the Booth decoder is designed, such that early arriving signals (from the

plicand) have a greater delay "through" the decoder that inputs from the Booth encode

balances the signal paths and allows reduced glitching. Unfortunately, the above stud

not present data confirming these theoretical results.

The presence of spurious signal transitions is the object of much study. For examp

characteristics of array partial product reduction schemes which lead to high switching

activity are addressed in [43], which attempts to create a more regular generation of p

product bits when they are needed. This work is studied more closely in Chapter 4.

Latching and gating of signals has been studied in various contexts, with successf

application to multipliers. Certain inputs have special characteristics which can be leve

to avoid unnecessary calculations. For example, a simple case is if one of the inputs t

multiplier is ’0’ the output is obviously ’0’. Such pre-computations can be used to stop

multiplier from calculating, through the use of latches on the inputs—if the output is use

the input latches are not made ’transparent’ for these inputs, and the next set of input

considered[21]. Furthermore, the case where a bit vector of all ’0’s is added to the par

result during carry-save addition implies that useless additions are being performed. W


5],

sign,

sev-

trees

issipa-

s in

to

d-

re

cy of

g

on bypassing carry-save adders is presented in[44], which shows how power savings can be

achieved through ’0’-aware circuitry. Latching at a circuit level has also been explored[4

and we further describe these developments in Chapter 4.

2.3.3 Architectural Power OptimizationA good deal of work has been done in analyzing architectural choices in multiplier de

at several levels. At the basic level, the architecture of the multiplier can be modified in

eral ways, which is one of the foci of this research. The choice of arrays versus Wallace

has primarily been based on their respective delay properties, while the relative power d

tion merits of each implementation are unclear[37][38]. We further describe these issue

Chapter 3.

The multiplier is typically a component of a larger system, which itself is assembled

implement a particular chip architecture. The interaction of the multiplier with its surroun

ings impacts the power of the entire system, i.e., the system impacts the power of the multi-

plier by supplying the inputs to the multiplier, and the multiplier provides outputs which a

read by the remainder of the system.

An example of work that considers this issues is [19], which investigates the frequen

transitions in high versus low order bits of a multiplier. Such information can lead to usin

smaller multipliers when necessary and only doing full-precision multiplication when the



need is detected[20]. Another approach at the architectural level is to consider to what

extent operations can be parallelized. If a number of operations are independent of each

other and can be performed in parallel, multiple hardware can be implemented on-chip, to

be run at reduced voltage. Speed is nearly linearly proportional to voltage, but power is pro-

portional to the square of voltage. Therefore, if the increased number of functional units can

account for speed reduction, a savings in power can be achieved[6]. Architecture-specific

optimizations can achieve siginificant gains, at the cost of compromising generality. This

thesis focuses on techniques which have wide applicability.

2.3.4 Low Power Multiplication ResearchThe above work shows some of the areas in which low power multiplication techniques

have been developed. A basic characteristic about multiplication circuitry is that the primary

goal is to perform the operation with high speed. Multiplication variants have been devel-

oped with the express purpose of reducing the delay required to get a result. Therefore, our

main focus was to analyze these low-delay blocks.

Initially, the choice of Wallace tree vs. array partial product reduction was based on

delay characteristics, as explained above. However, it is also possible to achieve delay

reduction through high speed adders. If an overall multiplier target can be met through an

aggressive adder design, perhaps it is not necessary to use a Wallace tree, since the design


effort is greater for a this type of circuit. Furthermore, it has been suggested that Wallace trees

should be avoided for low power applications [38]. This view has been contradicted by some

early experiments which we performed, indicating that Wallace trees have much lower power

dissipation while retaining their delay advantages. Such a point of view is confirmed by sim-

ple experiments in the literature[37]. Therefore, it might be more practical to use a low speed,

low power adder, in conjunction with a Wallace tree, to meet delay targets. In these studies, an

important consideration which was largely ignored was the impact of physical design on

power and delay. Our initial work attempted to resolve these ambiguities by instantiating a

series of Wallace tree and array designs with different adders, using a simple design flow

which accounted for layout characteristics (placement and interconnect capacitance). This

work is described in Chapter 3.

The presense of spurious transitions in multipliers suggests the use of latches to reduce

switching activity. Some work in this area has been done[43][45], although it focuses on array

multipliers, which are less interesting due to their high-delay properties. We attempt to deter-

mine whether it is possible to apply latching to Wallace trees, whose lower logic depth leads

to lower delay. Several challenges exist in this area, such as very wide signal path, and fewer

spurious transitions to ’reduce’, i.e., the design is initially quite low in power. The results of

our investigation are shown in Chapter 4.


Summary

lay—

n

e

ce cir-

ircuit

nti-

ding

wer

ts

se

power

ut

e

Circuit design in multipliers has often been targeted at optimizing delay. The circuit

design techniques described above attempt to reduce power while maintaining low de

they are successful to varying degrees. We investigated a delay reduction technique i

adders, which attempts to reduce the delay by lowering the logic depth. By lowering th

logic depth, we reduced the amount of circuitry present in the adders. If one can redu

cuit count, it is possible that power will be reduced, since the total capacitance of the c

implementation has been lowered. To assess the validity of this optimization, we insta

ated adders using a library of different reduced-circuit logic blocks. A full analysis inclu

layout and detailed SPICE simulations shows that this technique is viable for lower po

multiplication. This work is described in Chapter 5.

2.4 Summary

In this chapter, we present an overview of multiplier operation, as well as the basic

power dissipating charateristics in CMOS. While false transitions and parasitic curren

seem unavoidable in CMOS implementations of multipliers, our goal is to minimize the

effects while performing the operation. Design techniques have expressly focused on

reduction in recent years, although the larger goal is to achieve power efficiency witho

compromising delay, which is much more difficult. In subsequent chapters, we describ


analysis and design of multipliers for low power, using multiplier designs to reinforce our

conclusions. Particular attention is paid to physical design characteristics, which can affect

choices made at the logic level.


3 Power Trade-offs in Array and Wallace Tree Multipliers

3.1 Introduction

This chapter describes investigations into understanding basic power dissipation charac-

teristics of partial product reduction schemes. We attempted to understand the switching

characteristics of arrays and Wallace trees and how this switching leads to different levels of

power dissipation. Determining the answer required modelling not only the logical behavior

of each style but also creating designs with a rough representation of the physical design

characteristics of CMOS implementation, interconnect capacitance in particular. We arrived

at results which refuted earlier crude analyses of these structures. Results were published in

[46].


tial

he

blocks

deeper

in

ly

esign

wire

n

he

er.

3.1.1 Partial Product Reduction SchemesAs discussed in Chapter 2, two competing schemes are commonly used to perform the

partial product reduction step: arrays and Wallace trees. In the array structure, rows of partial

products are added incrementally, resulting in n total adds of n bits wide each. The final fast

add is also of size n. In the Wallace tree scheme, rows are added in parallel, so that at each

‘level’ in the process, m rows are reduced to rows. The total number logic levels for par

product reduction is . However, the final add is approximately size 2n - .

The array method is very popular as it lends itself to a clean VLSI implementation. T

structure can be laid out in a regular array, where components communicate with other

which are placed at adjacent locations, resulting in very few long wires. However, array

designs tend to be slower than those using the Wallace tree method. This is due to the

logic depth of arrays, which sets a lower limit on the maximum delay of the circuit, even

the presence of gate upsizing. On the other hand, the Wallace tree logic network is high

irregular, necessitating a custom placement and routing phase. Furthermore, physical d

of the Wallace tree can create long wires whose additional capacitance causes greater

load. Despite these drawbacks, the Wallace tree structure is the partial product reductio

method of choice in high-performance designs [28]. While it is generally accepted that t

Wallace tree is faster, it was not entirely clear which of these two designs is lower in pow

23---m

n3 2⁄log n3 2⁄log


Introduction

the

ir-

t the

gn

inter-

es, due

s.) In

on the

hich

As we began this study, we found some disagreement among the few sources who spec-

ulated on the power characteristics of arrays versus Wallace trees. Bellaouar and Elmasry

[38] suggest that Wallace tree styles are best avoided for low power applications, since the

excess wiring was likely to consume extra power. On the hand, Callaway and Swartzlander

[37] demonstrated quantitatively that switching activity within just the partial product

reduction hardware was substantially better for the tree over the array—if one ignores

wires completely. Montoye [30] alluded to the difficulties in dealing with Wallace tree w

ing for one high-performance commercial processor (IBM RS/6000), but concluded tha

speed gain is worth the trouble. Due to the complexity of the wiring problem, this desi

used counters larger than the basic CSA (3-inputs, 2-output) to reduce the amount of

connect. Unfortunately power was not a concern in this design.

A low power implementation of Wallace trees might be constructed using minimum

sized devices (to reduce gate capacitance), although speed would dictate larger devic

to the extra capacitance introduced by long wires (when contrasted to array multiplier

this case, not only is power an issue, but we have to be aware of interconnect effects

overall delay.

3.1.2 Analysis of Switching BehaviorThe main difference between array and Wallace tree structures is the method in w


data is processed. Array structures incorporate carry-save additions in a sequential manner,

whereas Wallace trees perform carry-save addition in a parallel manner. In both cases, the

number of adders is roughly equal, as the bit compression procedure using full adders (3 input

bits, 2 output bits) does not vary whether one performs the procedure sequentially or in paral-

lel.

Although the linear versus logarithmic delay properties of these two styles are clearly

understood (they are a direct function of the logic depth), the reason for different power dissi-

pations is a bit more subtle. Signal flow is described based on the example shown below (Fig.

16).

For this example, we assume simple adder operations (i.e., ignore the carry, etc., assume

simple one-bit operation,) and we assume unit delay through all adders. In each design, one

clearly sees the effect of logic depth on delay. Switching behavior can be deduced from exam-

adder1

adder2

adder3

addera adderb

a b

c

dadderc

a b c d

result

result

Figure 16. Sequential vs. parallel layouts for logic blocks.


Introduction

s at the

ly

ching

erties,

se

ining when the inputs cause switching to happen in downstream logic.

In the case of the sequential logic, where all inputs arrive at the same time, we see that at

time = 0, inputs ‘a’ and ‘b’ are added, and this causes the output of adder1 to toggle in the

next time frame. At the same time, ‘c’ is added to the initial output value of adder2 and its

output toggles in the next time frame. Similarly with the output of adder3. At time = 1, ‘c’ is

added to the new output of adder1, and the output of adder2 toggles again. Similarly with the

output of adder3. Finally at time = 2, the value at the output of adder2 is added to ‘d’ and the

final result is determined.

In the parallel case, at time = 0, the outputs of addera and adderb are calculated. At time

= 1, the output of adderc is determined and no more switching occurs.

From a switching perspective, we can see that for sequential arrangements, adder

nth level of logic switch n times, whereas for parallel operation, every adder switches on

once. Therefore, parallel organization of adders has a beneficial effect of reducing swit

activity.

Partial product reduction styles are only slightly more complicated since the adder

blocks generate both carry and sum bits. These two signals have different delay prop

so that both array and Wallace tree reduction networks experience glitching due to the


expe-

s tend

glitch-

ion.

unequal delays. Furthermore, the delay out of any given input is a function of the input vec-

tors; certain input patterns will cause the output to switch faster than other inputs. Finally, the

Wallace PP reduction tree is not always as symmetrical as Fig. 16 implies. In some cases, sig-

nals ‘skip’ a level, as shown in Fig. 17

In conclusion, glitching originates from several sources and even Wallace tree styles

rience these spurious transitions. Nevertheless, array partial product reduction structure

to experience a much greater amount of false switching than Wallace trees. For arrays,

ing due to unequal delay of input signals based on logic depth is the primary considerat

(a) (b)

Figure 17. Signal flow in arrays vs. Wallace trees. (a) In arrays, inputs are present aevery level of logic depth, so digital circuits at deeper logic levels experience more switching. However, the carry-save adders are arranged in rows, so signals tend to flow in "waves" down the logic. (b) Wallace trees have inputs at one logic level, so input data arrives in parallel and flows downward. However, some connections "skipa logic level and so input arrival times tends to be skewed at deeper logic levels.

level "skipped" level "skipped"


Introduction

nce

anal-

naly-

ltipli-

3.1.3 Initial InvestigationsIntuition suggests that the log-versus-linear depth of the reduction network for the Wal-

lace tree might well lead to shorter propagation paths and less power-consuming glitching.

We performed a first order analysis of the average transition counts across sets of random

vectors applied to both array and Wallace tree designs. This was a Verilog model which used

unit delay estimates to model the carry and sum delay differences. We assigned a delay

value of ‘2’ to the carry, and a value of ‘3’ for the sum, mimicing the rough delay differe

in a typical CMOS implementation. Input capacitances were not incorporated into the

ysis. The results are shown in Fig. 18

Interestingly, a very similar figure appears in [37]. That work presented a detailed a

sis of power for various adder forms, and then speculated on power dissipation in mu

0

20000

40000

60000

80000

4 6 8 10 12 14 16

Avg

. Num

. Tra

nsiti

ons

(100

tria

ls)

Multiplier Size (bits)

Array Multiplier

Wallace TreeMultiplier

Figure 18. Transition count comparison for multipliers.


ers by performing an analysis similar to the one described above. Reflecting our initial

hypothesis, the authors observed that Wallace trees should have a significant power advan-

tage.

Unfortunately, these analyses are incomplete due to a lack of consideration of wiring

effects. Since a strong point of arrays for both delay and power is the absence of significant

wiring capacitance, the above graph does not really resolve the debate, and therefore is unable

to support or refute the claim of Bellaouar and Elmasry[38].

To remedy this, we devised a more detailed evaluation model that considers not only the

gate-level differences, but also the wiring effects due to layout. As we shall see, experiments

suggest optimism for the Wallace trees: they are neither as bad as [38] suggests, nor as good as

[37] estimates, but within the limits of our model, the tree style appears somewhat better on

both energy consumption and on speed. We will describe our modeling methodology, the lay-

out strategy for comparing arrays and trees, and our experimental results.

3.2 An Improved Analysis methodology

Determining the effects of interconnect on delay and power requires a more detailed

model of the logical behavior and physical characteristics of our devices. The logic structure

was refined through the use of SPICE models to derive delay and power numbers at the circuit


An Improved Analysis methodology

level. This gives us detailed data which then can be used in gate-level simulation to deter-

mine more exactly transition behavior. The basic analysis methodology was initially devel-

oped for these experiments, and subsequently refined as the research progressed. To that

end, we will only relate the features which are relevant to this work. The comparison meth-

odology comprised several stages.

3.2.1 Generation of Component Library Logic gates were represented using SPICE transistor descriptions. We worked from fun-

damental blocks such as full-adders, half-adders and AND gates. This allowed us to lever-

age HSPICE and other simulators to develop characterized data for the delay and power

dissipation of our basic blocks. The number of basic CMOS blocks necessary to implement

a multiplier is fairly small, approximately 5-10, depending on how aggressive a design one

wishes to implement. For this initial work, we simulated only the full adder (CSA), half

adder (HA), AND and XOR gates and inverters. The size of the basic logic blocks was ini-

tially fairly coarse; for example, we represented a 3-input, 2-output full adder in a single

block.

An alternative choice would have been to characterize the smallest CMOS gates possi-

ble. For example, the full adder is composed of a carry stage, a sum stage and two inverters.

The choice of doing basic simulation on complex or simple gates has implications for the


accuracy of our delay and power measurements, and will be discussed in the following sec-

tion.

This initial work was done using parameters from the Hewlett-Packard 0.8µm CMOS pro-

cess (CMOS26G), with the designs assuming a 3V power supply.

3.2.2 CircuitryThe basic building block in all of our experiments is the full adder, which is used in the PP

reduction stage and the final adder designs.Static CMOS logic blocks were based on standard

designs, which can be found in [1]. The most commonly used implementation is shown in Fig.

19. The popularity of this particular design comes from the frugal use of transistors in imple-

menting both the carry function and the exclusive-or (sum) function: 28 transistors are used,

b

carry

a

c a

b

ac

a ba c

ba c

a

b

c

a

b

c

sum

Figure 19. Full adder, also called carry-save adder, implemented using the ’28T’ construction.



al,

E

out-

nges,

e

ed

that

moved

imu-

calcu-

which

e of

hence the designation “28T” cell. Note that the carry signal is faster than the sum sign

which might be initially be counter-intuitive from an arithmetic standpoint.

3.2.3 Cell CharacterizationIn this initial version of the design flow, our cells were characterized through SPIC

simulations of major logic blocks such as full-adders, half-adders and AND gates, with

put loads of 0 fF and 100fF. Delay and power were calculated for all possible input cha

in the worst case requiring O(22n) simulations for complete characterization. In practice, w

only simulated input vectors which caused the output to change. Delays were measur

from the input 50% point to the output 50% point. Power dissipation was of two forms:

consumed by the circuit switching, and that consumed as charge was delivered and re

from the output load. Data was not calculated for different input slopes. During logic s

lation, delay and power were computed as a simple linear interpolation from the data

lated for the two output loads. The actual output load was a combination of gate load (

could be determined from the interconnect network) and wire load (which required the

physical design step.)

Glitching: Simulators generally determine what change of inputs will cause a chang

outputs, and execute characterization runs for these input stimuli. Unfortunately, some

inputs generate a single glitch (e.g., output goes 0->1->0), and therefore a simulator can


es

miss the fact that a glitch may appear on the output and be propagated to downstream logic. In

fact, a glitch does occur for particular set of inputs to the full-adder. However, this glitch is

fairly small and generally gets filtered out fairly easily.

Accuracy: Characterizing different-sized blocks of logic using SPICE can be desirable or

harmful in several ways. If the basic blocks that we characterize are basic CMOS gates (sin-

gle-output stacks of transistors), we achieve a high level of detail, but we maximize the num-

ber of blocks that we have to deal with in a complete, assembled systme. On the other hand, if

the basic blocks we characterize are fairly larger than CMOS gates, we can lose some detail in

circuit behavior. Furthermore, such blocks generally require a larger input count. Since the

number of simulations for block characterization is a function of input count (in our case, an

exponential function of the input count), we would like our atomic blocks to be as small as

possible.

However, there is an advantage in simulating larger blocks. Generally when glitches

occur, it is very difficult to determine the energy dissipated since node voltages often do not

swing completely rail-to-rail. In fact, if the output glitches and does not reach the complemen-

tary rail before switching back, an event driven logic simulator may not count this glitch as an

‘event’. When simulating large blocks using a detailed circuit simulator, all internal glitch

are correctly accounted for.



tors

As

n runs,

ge for

etlists

rays,

ure

Due to these and other considerations, we initially decided on characterizing basic

blocks such as full adders and MUXes as single entities. Some of these logic blocks are

comprised of several independent CMOS gates (e.g., the full-adder is made up of 4 gates.)

Timing: When characterizing cells, we generally simulate for one or more inputs

switching at a particular point in time. It is very hard to account for the effect of inputs

switching concurrently, but starting at different times (i.e., one input begins switching, then

before it is finished, another input begins switching.) To accurately treat this effect, we

would need to run a much larger number of characterizations. We settled for running what

we considered an empirically “adequate” number of simulations: all possible input vec

which caused an output to switch.

Characterization for a cell library takes approximately 1.5 hours for our basic cells.

we developed more accurate models, we required a greater number of characterizatio

and this time increased.

3.2.4 Partial Product Reduction GeneratorsUsing full-adders and half-adders, we constructed the partial product reduction sta

both arrays and Wallace trees. This assembly method was devised from examining n

in [1] and was very straightforward; no optimization was attempted in this stage. For ar

it is possible to incorporate placement information during assembly, since array struct


rule

uitry

essen-

olved

in to

gular

t the

f the

implies a regular layout. For Wallace trees, the structure was undetermined until a later place-

ment stage.

3.2.5 Adder GeneratorsIn this stage, we examined three types of adders: the ripple adder, the carry-save adder and

the carry-select adder. These were chosen because they represent a range of delay optimality.

It has been established in [37] that there exists a very clear energy-delay trade-off in adder

designs, with faster adders consuming more energy per operation. Delay in adders is reduced

by incorporating circuitry which performs “lookahead” or prediction of calculations. As a

of thumb, the more circuitry which is applied to lookahead, the lower the delay. More circ

also implies more capacitance and therefore more energy dissipation. Our experiments

tially confirmed this view.

Adder generators are also fairly straightforward. Although there are sizing issues inv

in some of the adder models, we only implemented simple minimum sized circuits, aga

determine characteristics of minimum power configurations.

3.2.6 Layout Model and Wallace Tree PlacementTo evaluate complete multipliers, we require a common layout model for both the re

array and the Wallace tree. It is important here to avoid penalizing one style artificially a

expense of the other. We used a simple unit grid model, in which each component cell o



multiplier netlist occupies one slot in a set of standard cell-style rows (see Fig. 20). We

made all logic blocks unit sized, to facilitate placement. We assume over-the-row routing of

vertical wires connecting cells in different rows, but we conservatively estimate the impact

on total height and wire length of the wires that must make horizontal jogs in each wiring

channel. A post-process global router embeds wires into the placed model, and the maxi-

mum density of each channel is used to derive channel height and final wirelength esti-

mates.

For the array, we procedurally tile the regular partial product reduction cells into the

grid, with the final reduction adder carefully located at the bottom of the array. Since nearly

input pins

output pins

component rows

channels

Figure 20. Multiplier layout model.

global routes


all the connections in the array are nearest neighbor between cells in the partial product reduc-

tion network, the array fits well into this model. On the other hand, the Wallace tree requires

constructive placement and routing of its partial product generation (AND gates), reduction

network ((3,2) carry save adder tree) and final reduction adder. We use a simple annealing-

based placement strategy [47] that strives to minimize the overall wirelength while densely

packing the grid.

Finally, given a complete netlist and a real, although highly simplified placement, we can

extract back per-net capacitance and determine estimates of both power and delay for each

complete multiplier.

3.2.7 Power and Delay EstimationInitially, we developed Verilog files for the various multiplier architectures, where power

was computed by counting the number of transitions which occurred at each cell output, as

described above. For our more refined design methodology, we developed a simplified, cus-

tom logic simulator for these designs. The simulator is an event-driven evaluation engine,

where drivers send signals to receivers. If the value of a receiver is unresolved when a new

driving signal arrives, this corresponds to a potential glitch, and the previous driving signal

may be preempted. Since the filtering of small glitches from the event queue can cause power

estimation to be inaccurate (i.e., too low), power consumed during glitches is also estimated.


Experimental Results

The accuracy of this method was variable, and the inaccuracy grew with the multiplier

size. Although comparisons between our custom logic simulator, operating on characterized

cell-level netlists with back annotated wire delays, against transient device-level HSPICE

simulation on small (4 bit) multipliers showed results within 5% of HSPICE estimates, we

observed that estimates for larger multipliers were more inaccurate. Our later work concen-

trated on improving the accuracy of our estimates.

In practice, what interests us is not the absolute accuracy as much as the relative accu-

racy of estimates between designs. For example, we seemed to consistently overestimate the

energy dissipated in our designs; this overestimation grew as the multiplier size grew. How-

ever, this overestimation was fairly consistent across all types of multipliers at a given size.

The relative difference in power between to multiplier designs at the same size, was in most

cases very close to the real relative difference, as calculated by HSPICE. This will be

described in more detail in later chapters.

3.3 Experimental Results

We explored 18 different multiplier implementations in all: two different architectures

(array versus Wallace tree), three different final reduction adders for each multiplier (carry

select, carry skip, carry ripple), and three different word widths (8, 16 and 24 bits).


3.3.1 Layout CharacteristicsTable 3 and Table 4 show the area estimates for each multiplier. Recall that wiring impacts

the overall area because of the density estimates for each wiring channel. Unsurprisingly, the

array multipliers are always smaller due to their much more local wiring. Also, the ripple-

adder versions are always the smallest, again due to less adder hardware and more local wir-

ing in these adders. As a point of comparison here, [32] describes a 6 bit array multiplier as

part of an 8-tap FIR filter that occupies roughly 0.4 mm2 in a 0.8µm CMOS process; this sug-

gests that our area estimates are reasonable.

3.3.2 Energy Per OperationFig. 21 shows estimated average energy per multiply operation for each of the multipliers.

Results suggest a fairly consistent 10% energy advantage for the Wallace trees, across the

Table 3. Array multipliers - estimated area (mm2)

Adder Type 8 bit 16 bit 24 bit

Carry Sel. 0.668 2.195 4.579

Carry Skip 0.627 2.099 4.430

Ripple 0.532 1.913 4.156

Table 4. Wallace tree multipliers - estimated area (mm2)

Adder type 8 bit 16 bit 24 bit

Carry Sel. 0.759 2.626 5.625

Carry Skip 0.736 2.537 5.576

Ripple 0.725 2.488 5.576



three bit widths examined. Despite the larger amount of irregular wiring, the shallower par-

tial product reduction in the Wallace tree appears to be advantageous for power.

3.3.3 DelayFig. 22 shows estimated delay for each multiplier. In general, the delays are fairly long

since we use a 0.8µm CMOS process, and because all devices are of minimum size. For the

small 8 bit design, results of semi-exhaustive simulation over all pairs of inputs appears in

Table 5, along with more conservative static timing estimates for comparison. That is, we

simulated all possible inputs (n inputs, 2n simulations) as simulating all possible input tran-

Figure 21. Estimated average energy per multiply op.

0

200

400

600

800

0 10 20 30

Multiplier Size (bits)

En

erg

y p

er o

per

atio

n (

pic

oJo

ule

s)

Array/C.Select Adder

Array/C. Skip Adder

Array/Ripple Adder

Wallace Tree/C. Select Adder

Wallace Tree/C. Skip Adder

Wallace Tree/Ripple Adder


sitions was deemed to be impractical (22n simulations).

The static timing estimates are pessimistic by 30-50%. The greatest discrepancy occurs

with the carry-skip adder, which has many false paths, due to carry propagate prediction cir-

cuitry. Note that the carry skip adder does not show a speed advantage at low bit width, due to

the lookahead circuitry. Without this circuitry, the carry skip adder behaves like a ripple adder.

Also note that the use of a ripple adder completely negates the advantage of using a Wallace

tree, as expected.

Fig. 22 offers both a pessimistic (static timing) and an optimistic (worst case encountered

during simulation of random patterns) timing estimate for each multiplier. The wide overlap

of the array and Wallace tree timing intervals certainly suggests that the Wallace trees are at

least competitive in delay. Indeed, the intervals for each Wallace tree cover smaller delays

than the corresponding array interval, which also suggests that the trees are faster, again as

expected.

Table 5. 8 bit multiplier - estimated delay (ns)

array Wallace tree

exhaustive simulation

static timing

exhaustive simulation

static timing

Carry sel. 16.12 22.73 14.86 18.90

Carry skip 18.93 28.93 18.42 27.25

Ripple 18.79 27.94 18.81 26.47



•delay

ed by

Taken together, Fig. 21 and Fig. 22 suggest that the Wallace trees have an energy

advantage over the regular array multipliers. Wallace trees are not so bad as suggest

[38], nor as significantly superior as estimated by [37].

Figure 22. Estimated worst-case multiplier delay (ns).

X X

X

X

X

X

v v

vv

v

v

l l

l

l

l

l

∆∆

∆

∆

∆

∆

u

u

u

u

u

m

m

m

m

m

0

20

40

60

80

100

120

Xvl∆um

Ripple Adder - (sim. estimate)

8 bit 16 bit 24 bit

Carry Skip Adder - (sim. estimate)Carry Select Adder - (sim. estimate)

Carry Select Adder - static timingCarry Skip Adder - static timing

Ripple Adder - static timing

um

Del

ay (

nano

seco

ndds

)

Arr

ayW

alla

ce

Arr

ayW

alla

ce

Arr

ayW

alla

ce


m

on

lation

mple,

pe

e

unt

istance

, at a

3.3.4 Further Modelling RefinementsFor the purpose of architectural comparison, we developed a simple modelling methodol-

ogy which allowed rough comparisons of design styles. Our estimates are based on a custom

logic simulator, which was necessary due to the large amount of time required to run HSPICE.

The layout details were coarse so as to be able to generate and compare large sets of designs.

Our goal for subsequent experiments (in Chapters 4 and 5) was to refine the methodology, and

to improve the accuracy of simulation results and layout detail.

Simulation values use HSPICE as a ’golden standard’. A main limitation of our custo

simulator was that it determined delay and power values using interpolation based only

two data points (delay with output load 0fF and 100fF.) We needed to upgrade our simu

environment to be able to interpolate among an arbitrary number of data points (for exa

6 loads, ranging from 0fF to 60fF.) Furthermore, we did not initially account for input-slo

effects on delay and power. Again, modifications were planned, to allow for interpolation

from an arbitrary number of points (we determined that the majority of input slopes rang

from 100ps to 2ns).

A second source of inaccuracy arose in our simple layout model, which did not acco

for unequal cell sizes. This tended to throw off area estimates, as well as interconnect d

estimates. The error is mitigated by the fact that array and Wallace tree implementation


Summary

ular

first-

pable

al-

given size with the same type of final adder, have nearly exactly the same count of full- and

half-adders, as well as additional circuitry. Nevertheless, it is quite possible that internal

wiring lengths, which are based on the size of these blocks, will be off as a result. In later

experiments, we determined that we should estimate cell sizes based on transistor counts,

and use these estimates during layout.

Finally, our use of global wiring data to determine wire capacitance lacked detail. A

desirable refinement would be to incorporate detailed wiring capacitance numbers. This

may well be a requirement in very deep submicron technologies, where wire capacitance

may be a dominant factor. Although it is difficult to understand this effect without imple-

menting a full routing-and-extraction methodolgy, we should verify this result. For our final

experiments, we implemented a few designs using the Cadence tool flow, allowing us to

extract and simulate complete circuits—these are discussed in the next two chapters.

3.4 Summary

By introducing a simple unit grid layout model, we have been able to compare reg

array and Wallace tree style unsigned multipliers over bit widths 8 to 24 bits, including

order delay and area effects due to physical wiring. The model is clearly coarse, but ca

of making basic predictions for area, for average power, and delay. Interestingly, the W


lace trees fare rather well, despite their irregularity and excess wiring. The smaller depth of

their partial product reduction hardware seems to offset the power lost in the wiring, offering

improved energy and delay. We believe this preliminary result justifies closer investigation

with more refined models of the problem, e.g., use of more aggressive exact critical path anal-

ysis[48], and better layout optimization, to determine more accurately the advantages of the

Wallace tree style.


4 Minimizing Switching Activity By Latch Insertion

4.1 Introduction

The large amount of power dissipation due to false switching in multiplier designs leads

to the question of to what extent one can remove or minimize this false switching by altering

the logic structure. The focus of glitch elimination should be the partial product reduction

tree, as this logic comprises over 50% of the total gate count.

In this chapter, we consider the use of latches inserted into the logic structure to delay

early transitioning signal lines, and attempt to determine their effect on different multiplier

schemes. Our initial interest in the idea of latch insertion was motivated by prior work

which applied this technique to array multipliers[43][45]. We wished to determine whether

latches can be successfully used in Wallace trees. We developed a latch insertion methodol-


ogy which targeted portions of the multiplier with high switching activity, while attempting to

minimize the power dissipated by the extra latches. Attempting to balance the power saved

versus the extra power dissipated is very difficult. Our results showed that although switching

is reduced through the use of latches, various characteristics of the Wallace tree structure ren-

der the gains minimal when the overhead of the latch reduction circuitry is included.

4.2 False Switching in Multipliers

The root cause of false switching in CMOS logic is due to input signals arriving at differ-

ent times, thereby causing the output of a logic gate to transition several times before settling

to its final value. We are interested in glitching caused by the following delay effects.

a) input paths of unequal logic depth - in this case, one signal path has to traverse more

gates than another signal path, before arriving at the input. If all gates have the same delay, the

gate with the deeper logic depth will arrive later (Fig. 23a).

b) input paths with high delay gates - if a path consists of gates which have a large delay,

signals traversing this path will be delayed, even if they have the same logic depth as other

paths (Fig. 23b).

c) delay caused by input dependent delay - for example, if a conventional two-input

NOR has its inputs switch from 00 to 01, the fall time will be longer than if the input had


False Switching in Multipliers

lse

aral-

ns by

switched from 00 to 11.

d) glitching which is propagated from upstream - if the input to a gate switches several

times, the output may also switch several times. This is particularly true if there is signifi-

cant delay between switch events. Note however, that if the transitions occur fairly close

together, or the input is a non-controlling input, the output may not toggle (in effect, a gate

“filters out” a glitch.)

4.2.1 Input LatchingIn Chapter 3, we described how array multipliers encounter a greater amount of fa

switching than do Wallace tree multipliers. Whereas all the inputs to adders arrive in p

lel, the additions themselves occur in a sequential manner, requiring multiple calculatio

the same adder block.

highdelay

lowdelay

highdelay

lowdelay

Figure 23. Forms of glitch-inducing delay

(a) (b)


One approach to alleviating this problem is to make the inputs available only when they

are needed; that is, to delay inputs which arrive earliest, until the latest available input is

present. In this manner, the logic blocks see all inputs change at approximately the same time.

In the array multiplier, it is evident that a great number of partial product bits should be

delayed, since it is not necessary that they all arrive in parallel. Specifically, the PP bits which

feed into later adders should be more delayed. This is the approach suggested by Lemmonds

and Shetti[43] and is depicted in Fig. 24. The delaying of the input bits is performed using a

multiplier multiplier

booth decoders

booth encoders latches

Figure 24. Using latches to re-time the generation of partial products (Booth).

(each is triggeredat different times)



on all

logic

series of latches, which are each timed to delay early arriving input bits, so that their arrival

times are synchronized with later arriving inputs. This can be applied to basic PP bit genera-

tion, as well as Booth encoded multipliers.

The results from [43] indicated that power in a 16 bit Booth recoded multiplier could be

reduced by up to 40%. Note that there are paths that go from the input of the multiplier to

output, but do not encounter a latch. Since we are slowing down the excessively fast paths in

this scheme, we should theoretically be able to generate a multiplier which does not have a

delay greater than the non-latched version.

This type of latching addressses false switching caused by a) unequal logic depth, but

does not deal with b) high delay gates, c)input dependent delay or d) propagated glitches.

Insofar as glitches generated by a) are reduced, some effect is felt in reduced d) propagated

glitches.

4.2.2 Latching the Signal PathThe previous work equalized the delays at all the primary inputs, then performed the

caluculation. An alternative which reduces glitching more generally is to insert latches in

the signal path at deeper logic levels, to ensure that all sources of unequal input delay are

“equalized.” In this method, blocks whose delays are equalized have latches present

inputs. When a new multiplication is started, the latches are initially closed. For each


logic

k, but

eans

is

differ-

nals.

atch

priate

block, the delay of the latest arriving signal is calculated, and each latch is then triggered with

a signal that arrives when the latest arriving input is steady. This is described in Fig. 25.

Several implementation details are relevant. A “row” of latches is inserted at a given

depth all across the signal path—not only do latches equalize delay at a given logic bloc

outputs from adjacent logic blocks are also synchronized to the given logic block. This m

that an equalization effect is seen at later logic blocks.

The latch triggering signal can be implemented by a chain of inverters, whose length

determined by the required delay of the triggering signal. In the case where latches use

ent triggering signals, this chain of inverters can be used to generate several timing sig

The signal propagates down the inverter chain and is incrementally delayed, so that a l

which is to be triggered at a particular time simply connects to the delay line at the appro

trigger

Figure 25. Using latches to equalize signal arrival times in the signal path (Transition Retaining Barriers.)



the

ng an

ropri-

ughly

is

is

limi-

location. In this manner, power is conserved by using the same timing signal for all latches.

In practice, for reasons which will be discussed later, we prefer to trigger latches in parallel.

This is feasible for array multipliers, whose regular layout suggests using a bank of latches

which are triggered off of the same signal. To achieve this, the triggering signal generated

by the chain of inverters is buffered to drive all the latches in parallel.

4.2.3 Previous WorkThe above approach was investigated by Musoll and Cortadella in [45]. The latching

structure is incorporated into the full-adder circuitry by putting ‘enable’ transistors into

pull-up and pull-down paths. This type of logic is called Clocked CMOS or C2MOS, of

which a description can be found in [1] (see Fig. 27). Using this technique avoids addi

explicit latch (an extra logic stage) but does introduce delay if the devices are not app

ately upsized. The authors define the use of this structure as implementing a transition-

retaining barrier (TRB). The methodology for determining the value of this technique is

straightforward. Since array multipliers consists of rows of adders which operate at ro

the same time, the objective is to find the row of adders in which to place the TRBs. It

also possible to insert more than one row of TRBs.

The effect of TRBs for array multipliers is shown in Fig. 26. Initially, false switching

a linear function of logic depth. Where TRBs are inserted, false switching events are e


nated for devices at that point in the PP reduction stage. Furthermore, false switching events

which would have been propagated to later stages are also removed. The location of the row

of TRBs is important, since locating them too early or too late in the PP reduction stage

reduces the amount of gain which is achieved.

The conclusion of this research was that it is possible to reduce spurious transitions in a

32-bit multiplier by up to 30%, while incurring an 8% delay in penalty. Generally, more TRBs

8 16

8

4

8 16

8

4

4 16

8

4

14 16

8

4

Figure 26. The false switching in an array is linear in terms of the logic depth. a) For a 16 bit multiplier, the false switching at logic depth 16 is approximately 8 toggles/operation. b) Inserting a TRB saves some of the switching (gray box.) c) The TRB should not be inserted too early or d) too late, as this lessens the amount of false switching that is elminated.

(a)

(b) (d)

(c)

logic depth

num

ber

of tr

ansi

tions

false switchingeliminated



rrent

lution

tu-

yield a greater reduction in energy dissipation but also cause greater delay. Furthermore,

larger multipliers have more false switching and therefore more energy reduction can be

achieved with respect to the base case.

4.2.4 General Principles of TRB Insertion

Basic circuitry

There are several methods of incorporating latching behavior in CMOS, e.g., latches,

flip-flops, pass transistors. Our goal is to use latches which are lowest in power, while mini-

mizing delay effects. The use of C2MOS logic has the advantage that no additional logic

stages are introduced. Unfortunately, incorporation of another transistor in the pullup and

pulldown stacks of transistors increases delay due to increased pullup/down resistance. This

can be mitigated by increasing the width of all transistors in the pullup/down paths.

Another problem with latching occurs when the latches are closed for extended periods

of time. When the latches are turned off, the logic at the output of the latches is not being

driven (there is no conducting path from the gate input to the rails). This condition is known

as a “floating” gate, and if the voltage on this gate drifts, this can cause short circuit cu

to flow in the driven gate. Note that the latches in [45] do not address this issue. The so

is a pair of inverters in an SRAM configuration, one of which is weak (Fig. 27c). Unfor


ore

nately, this increases the power dissipation of the latch.

We investigated alternative designs to incorporate latching behavior. One method was

using a simple pass transistor with the back-to-back inverters for holding the state. This has

the advantage that we can eliminate a great deal of gate capacitance (capacitance on the input

of the latch). Unfortunately, pass transistors are good for passing either a high signal or a low

signal, but not both; this causes unequal rising and falling delay. Another technique was to use

a pass gate (Fig. 27d) —this alleviates the problem of unequal rise/falltime, but uses m

power. However, the use of a pass gate is superior to C2MOS, in that it has slightly less on-

in

clk

clk

out

in out

(a)

(b)

in

clk

clk

(c)

(d)

clk

clkin out

Figure 27. Incorporating latching behavior into an inverter. (a) Inverter (b) C2MOS version of inverter. (c) Incorporating state preservation. (d) Using a pass gate.

weaker



resistance for the same size gates (since two transistors in parallel are in the conducting

path.) We used the pass-gate with back-to-back inverters in our experiments.

Latch timing signal

The latches are triggered by a signal which is supplied at the appropriate time. As men-

tioned above, this signal should arrive when the latest arriving input to the adder blocks

transitions. To generate this signal, the triggering time is determined by static delay analysis,

and a chain of delay elements is constructed to replicate this delay time. These delay ele-

ments should be as power efficient as possible.

A simple method of providing a delay is to use back-to-back inverters. More efficient

designs might be current starved inverters, which are more power-efficient in generating

delays. These have several problems: in the case where we wish to tap the delay line at vari-

ous points, there is not a fine resolution in delay times to choose from. Secondly, there may

be manufacturing issues which cause the delay time to vary, and this can be exacerbated in

current-starved inverters. If the latches are triggered too early, the latches will provide less

power reduction than desired; if the latches are triggered too late, the multiplier will experi-

ence greater delay than expected.

The fact that the delay elements consume power implies that there will be a bias towards


placing the latches shallower up in the tree (at a lower logic depth), since placing them at

deeper logic depth involves using more delay elements.

Note that having a delay element (inverter) drive several latches can cause the delay ele-

ment to see a very high load capacitance. This will cause the delay of the inverter to be large,

since the rise/falltime of the output is now very long. This effect can be overcome by upsizing

the delay blocks, or for very large loads, having series ratioed inverters [2] or buffers drive the

latches (see Fig. 28). Upsizing inverters will then add more gate capacitance, which will

increase power. However, having signals being driven strongly means their rise-falltimes are

very sharp, and therefore the short-circuit component of power dissipation is minimized, as

described in Chapter 2, Fig. 11.

clk

latch latch latch

clk

latch

latch

latch

Figure 28. Triggering of latches. A chain of inverters may be used to generate the delay signal. If all latches are driven in parallel a) the final signals should be buffered. Otherwise, b) the delay chain can be used unbuffered, assuming the load is not so great.

(a) (b)buffers


Latches as Transition Retaining Barriers in Wallace trees

4.3 Latches as Transition Retaining Barriers in Wallace trees

The analysis by Musoll and Cortadella[45] showed that TRB techniques can be success-

fully used to reduce power dissipation in array multipliers. However, the more interesting

question is whether something similar can be applied to Wallace trees. There are several

issues which warrant such an investigation.

Although Wallace trees are a fairly well-established idea, they have recently becoming

more widely used in industry. The main reason is the much improved delay properties of

this type of partial product reduction stage which make them more attractive for high-speed

design. Just as important has been the realization that although the irregularity of the layout

and wiring requires a much greater design effort than array schemes, the increased availabil-

ity of optimizing CAD tools has facilitated the design process. Finally, the recognition of the

importance of the multiplier (and the consequent willingness to devote much more time to

this block) has led to Wallace tree designs becoming more prevalent.

We decided to determine whether one could apply TRBs to Wallace trees. We targeted

Wallace trees with carry-select final adders, as these were the fastest but highest-power mul-

tiplier designs that we investigated (as shown in Chapter 3).

4.3.1 Placement of LatchesIn the array designs, the placement of latch elements is fairly straightforward, due to the


regular structure of the carry-save adders. Recall that an array structure is laid out in levels,

with a very regular, repeated structure. To a first order, signals travel in a wave down the array

(actually, a series of waves, due to unequal logic depths from the inputs,) with adjacent circuit

elements switching at approximately the same point in time. When placing a row of latches in

the array structure, one can trigger all latches in a row at the same time without major effect

on the delay.

The row of latches can be moved up or down in the array, until the optimum point is

reached. Furthermore, several rows of latches can be incorporated into the same array. The

ideal number of rows of latches can be determined experimentally. The optimum occurs when

the overall power is no longer reduced (the overhead of putting more rows of latches starts

(a) (b)

Final adderFinal adder

Figure 29. Width of signal path: (a) In arrays, the width of the signal path is constant at all logic dephs, but in Wallace trees (b), the width of the signal path is greater at shallower logic depths, but width decreases as the logic depth increases.



a reg-

num-

-

as

the

e tree,

mber

overwhelming the power reduction achieved through glitch reduction)

4.3.2 Wallace Tree Latch PlacementThe placement of latches in a Wallace tree is more complicated than arrays. This is due

to the varying “width” of the signal path at a given logic depth. Recall that arrays have

ular structure consisting of a row of carry-save adders which is repeated n-1 times for an n

bit width multiplier. For the partial product reduction stage, at a given logic depth, the

ber of gates in a row of carry-save adders is always n. This means that we can block all sig

nal paths in the PP reduction stage using (n * # of inputs) latches. We can describe arrays

having a “width” of O(n) in the PP reduction stage (Fig. 29a).

Wallace trees, on the other hand, have a width which is variable by logic depth. At

first level, the number of signal paths is O(n * log3/2 n). The width of the signal path

decreases until we have two bit vectors to add; the final width is approximately 2n - log3/2n

(Fig. 29b).

We can clearly see that there is a bias towards placing a row of latches deeper in th

since each at each level, the number of blocks in a row decreases by log3/2n. (This is in con-

trast to the bias towards having latches at a shallower logic depth, to minimize the nu

of elements in the inverter chain.)


4.3.3 Latch Insertion MethodologyOur initial design flow for latch insertion relied upon hand-tuning to optimize results. The

placement of the latches was determined procedurally. Refinement of latch placement was

performed by hand. The connection of signals from the inverter delay chain to generate timing

of the latches was also arranged by hand. The procedure is as follows (Fig. 30):

• Create a chain of inverters .

• Determine placement of latches.

• Set timing of latches.

• Trim inverter chain.

Final adder

clk

1

2

3

4

Figure 30. Procedure for latch insertion: 1) create a chain of inverters 2) determine placement of latches 3) set timing of latches 4) trim inverter chain.



ch are

d loca-

hese

gest

Wal-

Create a chain of inverters

Initially, we need to create a chain of inverters to provide a series of trigger points

(points in time) for latches. First, we perform a static timing analysis of the multiplier. This

gives us a maximum delay through the multiplier. We then create a primary input which will

provide the initial timing signal for the latches. To this input, we add a delay block (two

back-to-back inverters), and calculate its delay. If this delay is less than the delay of the mul-

tiplier, we add a new delay block to the output of this delay block. We repeat this process

until a chain of inverters has been created whose overall delay is equal to or greater than the

delay of the multiplier—this means we are guaranteed to be able to trigger latches whi

placed anywhere in the multiplier.

Determine placement of latches

Good locations for latch placement are determined, and latches are inserted. Goo

tions are points which yield a maximum of power reduction using the fewest latches. T

points are determined empirically through hand experiments. Two locations initially sug

themselves for latch insertion: at the Wallace tree/final adder boundary, and inside the

lace tree (Fig. 31).


The Wallace tree/final adder boundary is a good candidate for two reasons: 1) a row of

TRBs can be placed here using a small number of latches because the "width" of the multi-

plier is smaller and 2) the adder is at a deep logic depth, and is experiencing a great amount of

false switching. A disadvantage is that since this is at a deep logic depth, a long inverter chain

is needed to generate timing for the latches.

The second location, inside the Wallace tree, is potentially rewarding because the effects

of the TRBs can be felt in downstream logic. Therefore, switching activity can be reduced in

the Wallace tree as well as the adder. Since these are at a shallower logic depth, fewer invert-

ers will be needed. However, the "width" of the multiplier at shallower logic depths is much

greater.

These points form the basis of the experiments described in the next section.

Final adder

Figure 31. Potential latch insertion sites: a) at the Wallace tree/final adder boundary and b) in the Wallace tree.

(a)

(b)

Wallace tree



re

re the

educ-

Set timing of latches

Prior to latch insertion, a static timing analysis has been performed on all inputs to logic

blocks in the multiplier. These times serve as a basis for determining when latches should be

triggered. For example, for inputs A, B, and C, if signals arrive at time (A = 5ns), (B = 7 ns)

(C = 9ns), latches should be placed on A and B, and they should be triggered at time 9ns.

Timing is determined on an incremental basis: latches are ordered by the time they are to

be triggered, earliest to latest. First, the earliest triggered latches are connected to the point

on the delay line (chain of inverters). As mentioned earlier, connecting latch inputs to an

inverter output increases the delay of the inverter. Therefore, after connecting latches to an

inverter, the timing of the entire delay line is recalculated. Then the next set of latches is

connected to the delay line, and the process iterates until all latches have been connected.

Trim inverter chain

Generally, loading the delay line causes the inverter chain delay to increase. The final

inverters’ delay often exceeds the initial multiplier maximum delay, so extra inverters a

trimmed from the end of the chain.

4.3.4 Placement of Latches on the Wallace Tree/Final Adder Boundary Since Wallace tree PP reduction stages have a “width” which is narrowest just befo

final adder, we decided to investigate putting the latches between the partial product r


rive all

need

ny

ee

tion stage and the adder. This would minimize the number of latches which we need to insert.

Quite a bit of false switching occurs in the final adder, and the power is worse for adders

which have highly parallel carry-lookahead schemes.

If a row of latches is placed between the PP reduction stage and the adder, the timing of

the latches can be optimized to take into account the carry-in ripple effect of the adder. If the

timing of each latch at a higher bit position is successively delayed, the inputs to each block of

the adder can be made to arrive at the same time (roughly) as the carry-in. (Timing is not

exact, due to input-dependent delays.) Inserting the delay blocks is simply an extension of the

delay blocks necessary to delay the trigger signal. Note that these delay blocks will also con-

sume power. We call this “cascading” the trigger signal.

When driving a final adder using the cascaded delay elements pattern, we cannot d

latches from a single buffer since the timing of each latch is different. However, if all the

latches are driven by different timing signals (points on the inverter chain), we may not

to buffer the timing signals driving the latches, since one timing signal does not see ma

loads (as explained in Fig. 28). The question was to determine whether timing signals s

heavy loads.



is

east

der

layed

ies are

pple

everal

e inputs

For final adders which are of the “ripple” variety, the minimum delay of each block

determined by the carry. Each full-adder block’s outputs are successively delayed at l

one gate level due to the ripple effect. This means that if the other inputs to the full-ad

block are to be latched, they should be triggered by a signal which is successively de

by one adder block (Fig. 32)

Consider the final adder implemented by a carry-select scheme. The delay propert

shown in Chapter 2, Fig. 9. Recall that this kind of final adder consists of a series of ri

adders operating in parallel. Since the ripple adders operate in parallel, the inputs of s

ripple adders often are added at the same time, and therefore, latches placed on thes

Final Adder

trigger

delay blocklatch

d

d d d

Figure 32. "Cascade" triggering style - in this method, the triggering signal of latches on the inputs of an adder are incrementally delayed, so that the inputs are delayed to compensate for the delay of the carry.


should be triggered at the same time. Therefore, the condition exists where the delay signals

drive several latches in parallel and see heavy load.

Faced with this condition, we can either 1) use buffers to drive the latch timing signals,

and maintain (roughly) load-invariant timing characteristics of the delay chain, or 2) allow the

timing of the delay chain to be affected by the loads, and insert/remove some delay elements

to get the proper timing. We believed that 2) was more straightforward to implement and was

also lower in power, so we chose this method.

The potential for power savings comes from the false switching present in the final adder.

There is a great deal of false switching due to the successive delay of the carry signal through

the adder. The other inputs to the adder arrive at a more uniform delay time, so at high bit

positions, the inputs to the adder blocks arrive much earlier than the carry blocks. Further-

more, in very fast adders such as the carry select, there are a large number of blocks in the

adder, and the total amount of capacitance switched by skewed inputs is greater than simple

ripple adders.

Results for this analysis are described in section 4.4.

4.3.5 Placing Latches in Wallace TreesJust like in array schemes, the greatest amount of false switching occurs in the later stages

of the Wallace tree. In [45], the authors found that the power reduction as a function of latch



placement (in terms of logic depth) was a concave function with a minimum halfway down

the array (See Fig. 26). These results suggest that we should try to move the latches up from

the adder inputs, into the PP reduction stage itself.

Latch placement for transition reduction is based on levels. Elements are ordered by

their minimum and maximum logic depths. One then decides to place latches at "2 elements

down from the inputs" or "3 elements up from the outputs." For example, placing latches "0

elements up from the outputs" means placing latches at the outputs (see Fig. 33).

We developed a simple procedure which calculates the minimum and maximum logic

depths of each component. We empirically tried placement of latches at a logic depth "up

from the bottom" by various levels. We then attempted to generate the timing signals for

these latches.

logic depth

min: 1max: 1

min: 2max: 2

logic depth

min: 1max: 1

min: 2max: 3

min: 1max: 2

Figure 33. Placing latches by logic depth. a), b) calculating logic depth, c) placing latches "one up from the bottom."

(a) (b) (c)

latches


es.

e min-

t Wal-

"up"

f

Placement of latches

A difficulty which we encounter when placing latches in the PP reduction stage is that the

notion of “levels of logic” which is present in array schemes is ill-defined for Wallace tre

Arrays have the property that for every component in the array, "minimum logic depth" =

"maximum logic depth", as in Fig. 33a. Wallace trees have cases like Fig. 33b, where th

imum and maximum logic depth can be quite different. For example, a section of a 16 bi

lace tree multiplier is shown in Fig. 34 If we attempt to place latches at one level of logic

from the final adder, latches are places as is shown in Fig. 34b.

We can note two effects from placing the latches at this position. First, the number o

HA

CSA

HACSACSA

CSA CSACSA

PP reduction

adder

HA

CSA

HACSACSA

CSA CSACSA

PP reduction

adder

Figure 34. Placement of latches in Wallace tree. a) Original structure, and b) placement of latches, "one level" up from the PP reduction/adder interface.

(a) (b)



ls

re

dant

al of

while

st,

le-

ob-

gering

y ele-

ns in

latches increases tremendously, as expected, due to the greater width of the tree at low logic

depth and the lesser width of the tree at higher logic depth, near the final adder.

Most problematic is the lack of clearly defined “levels”. Since we have some signa

which ‘skip’ logic, as described in Chapter 3, Fig. 17, we encounter several paths whe

there is more than a single latch blocking the signal. We must therefore remove redun

latches and generate appropriate timing for the remaining latch elements. This remov

latches is an ad-hoc process, where we attempted to minimize the number of latches

maintaining the transition retaining barrier effect.

Timing issues

Inserting latches in the PP reduction stage creates further difficulties for timing. Fir

note that we have more latches to trigger. This will cause more loading on the delay e

ments which drive the latches, and will create a very complicated timing generation pr

lem.

Secondly, since the latches are present at a shallower logic depth, the times for trig

the various latches are closer together (in time.) Therefore, it is more likely that a dela

ment will have to drive more latches. Again, this causes load-dependent delay variatio

our timing chain.


cad-

P

d. It is

ted to

er.

allace

ed in

ng sig-

false

bits.

, but

that

Thirdly, we must now consider timing to reduce glitching in the partial product reduction

stage as well as in the adder. For carry-select blocks, timing requirements of the adder conflict

with the timing requirements of the partial product reduction stage. Adders desire a “cas

ing” timing pattern, so that their inputs are made valid when the carry becomes valid. P

reduction stages want their inputs to be made valid when their neighbors’ inputs are vali

not clear which of these two effects should have precedence over the other. We attemp

generate timing for both these cases, to determine whether one dominates over the oth

4.4 Experiments

In this section, we describe results for several experiments using latch insertion in W

trees. As we mentioned in the introduction, it turns out that the power reductions achiev

the Wallace tree were quite small when compared to the cost of adding latches and timi

nals. We will first describe our experiments and the offers some analysis.

We initially ran latch insertion experiments on 8 bit multipliers, but power dissipation

results were consistently 20% worse than the base multiplier, without any latches. Since

switching tends to increase as multiplier size grows, we ran all other experiments at 16

For comparison, Musoll and Cortadella[45] achieved 5% worse power for TRBs at 8 bits

latches yielded a 7% power improvement over the base multiplier at 16 bits. Note also


Experiments

r the

al

ts (60

[45] includes physical design characteristics, while we have not incorporated these effects,

which would tend to make latched multiplier power dissipation worse.

4.4.1 Experiment: Latch Placement on the Wallace Tree/Final Adder BoundarySince we are looking to trigger latches at the boundary of the PP reduction stage and the

final adder, the timing signals were connected so as to generate a “cascade” effect fo

latches on the final adder, as mentioned above. Latches were placed all across the fin

adder, as shown in Fig. 35. In all, 54 latches were placed, along with 30 delay elemen

back-to-back inverters.)

Final adder

multiplicand multiplier

1616PPA generators

256

Wallacetree

Figure 35. Placement of latches on Wallace tree/Final adder boundary


n”, to

wer

or a 16

1%

worse

es is

s. We

t deal

e base

to

highly

y

are

tting

We connected the timing signals to the latches by hand and attempted to minimized power

dissipation by varying the timing of the latches to minimize false switching in the adder. In

these experiments, we used latches without the back-to-back inverters for “state retentio

minimize latch power. We were able to reduce power dissipated in the adder, but the po

dissipated in the delay chain and the latches was in all cases greater than the savings. F

bit Wallace tree with carry-select adder, the lowest energy design that we arrived at was

worse than the base case. Furthermore, the delay caused by latch insertion was 5-10%

than the base case.

We investigated several other alternatives. A characteristic of arithmetic CMOS devic

that false switching is worse at higher order bit positions than at lower order bit position

tried removing latches at the lower order bit positions, since they are not removing a grea

of false switching. The result was increased power dissipation (around 5% worse than th

case.) Removing latches at lower bit positions removes capacitance, but one still needs

keep the entire delay chain as the high order bit positions receive timing signals that are

delayed.

In conclusion, it appears that although adder switching reductions can be obtained b

inserting latches at the Wallace tree/final adder boundary, we see that power reductions

minimial when compared to the cost of inserting extra devices. We next investigated pu


Experiments

the latches inside the Wallace tree itself.

4.4.2 Experiment: Placing Latches in the Wallace TreeWe attempted to place latches in various configurations, to reduce glitching while mini-

mizing the number of extra latches involved. Moving the latches in various parts of the PP

reduction tree required a complete resynthesis of the timing chain, to reset correct delay

driving.

In the first case, we placed latches at "one level up" from the Wallace tree/final adder

boundary. Since this places redundant latches in the signal path, as described in Fig. 34, we

Figure 36. Placement of latches on within Wallace tree.

Final adder

multiplicand multiplier

1616PPA generators

256

treeWallace


ation

en-

lace

t

itions,

of

ndary

e

ed

removed some of these latches by hand. Timing was also performed by hand, and several tim-

ing configurations were tried. Following removal of redundant latches, we ended up with 93

latches and 20 delay elements.

In this experiment, we were never able to achieve power reduction, with respect to the

base Wallace tree multiplier without latches. In the best case, our design’s power dissip

was 10% worse than the base multiplier design, again with approximately a 10% delay p

alty.

Similar to the analysis of [45], we tried placing latches at higher levels up in the Wal

tree. In all cases, power dissipation was over 15% worse than the base case. We did no

observe that reduction of false switching at one "level" causes a great reduction of false

switching at subsequent depths of logic, as was the case in arrays.

As in the previous experiments, we attempted to remove latches at low order bit pos

within the Wallace tree. Again, this did not improve the power dissipation characterisics

this design.

4.4.3 ConclusionsOur best result achieved was to nearly break even when inserting latches at the bou

of the Wallace tree/final adder, for a 16 bit multiplier. That is, the power consumed by th

extra circuitry (latches and delay chain inverters) was nearly offset by the power remov


Experiments

ist in

s are

ing of

l delay

through the reduction of false switching in the array. Furthermore, note that this did not

include layout details (e.g., interconnect capacitance), which means that real implementa-

tions of latched multipliers probably have even worse power dissipation characteristics than

those determined by the above experiments.

Latch insertion for Wallace trees does not seem to generate power savings. Although

success has been obtained when using latches to eliminate false switching in arrays, a simi-

lar result was not achieveable in Wallace trees. We believe that this is due to differences in

the structure of the two types of design.

First, the Wallace tree has a very wide signal path. This width, which is constant in array

designs, starts off very wide at low logic depth but decreases logarithmically until the final

adder is reached. Even at the PP reduction/adder interface, this width is greater than the

width of the array. This means that more latches are necessary to create a transition retaining

barrier. If the designer wishes to create a TRB in the PP reduction stage, a very large number

of latches is needed.

Placement of the latches is unclear, due to the lack of clearly defined “levels”, as ex

arrays. Since many paths reconverge in multipliers, and the logic depths of these path

not equal, it is possible to have a path go through several latches—in this case, the tim

the latches is complicated, and since each latch adds delay to the signal path, the tota


sign

to

ately,

tch

educ-

r opti-

are

sug-

may become very large.

Timing of the latches using a delay chain is straightforward in array designs, since the

latches are triggered at the same time. Wallace tree designs require latches to be timed at dif-

ferent intervals, and this causes extra load on the delay chain, which impacts the timing itself.

Furthermore, the extra numbers of latches present in these designs means the overall loads can

be quite large. Overcoming the load-induced delay effects may require buffering, which will

adversely impact power.

4.5 Conclusions

In this work, we had originally intended to explore by hand various ‘corners’ in the de

space of latch insertion for power reduction, and then to use the initial empirical designs

decide where to focus a more general automated latch insertion methodology. Unfortun

none of the options we tried—latch insertion at the Wallace tree/final adder boundary, la

insertion at shallower logic depth in the Wallace tree—yielded any encouraging power r

tions. Therefore, we abandoned the idea of developing an automated latch-based powe

mizing system. An basic result of these investigations is that although power reductions

achieved through latch insertion, the cost of adding these extra elements to Wallace trees is

too expensive, when compared to the small amount of switching that is elminated. This


Conclusions

e

gests that removal of circuitry may be the best approach to power reduction in Wallace tree;

this is the subject of the following chapter.

It is unclear whether there are any circumstances under which Wallace trees lend them-

selves to latch-based power reduction techniques. For arrays, it can be argued that false

switching is so bad and the amount of power to be reduced is so great that many techniques

stand a good chance achieving gains in these designs. The main reason for array inefficiency

is the great discrepancy in logic depths that signal paths traverse before arriving at the same

logic block. This is not a major concern in Wallace trees.

Another more subtle advantage inherent in Wallace trees comes from their relatively

shallow logic depth. Unequal delays occur in every logic circuit due to reasons other than

logic depth, i.e., b) high delay gates and c) input dependent delay. Generally, for deeper

logic depth, the cumulative effect of these delays is magnified; since Wallace trees are shal-

lower logic depth, these effects will not be as important as they are in arrays.

Based on the above characteristics, we believe that latch insertion is not a viable power

reduction technique in Wallace trees. In the previous chapter, we showed that Wallace trees

are superior in terms of power dissipation, in comparison to arrays. Latches achieve power

reduction by lessening false switching—it is therefore not surprising that they are mor

effective in array designs. Although false switching is present in Wallace trees, it is the


activity*Capacitance product which is problematic, as Wallace trees have long interconnect

lines which have high capacitance. Therefore, schemes which reduce capacitance in Wallace

trees should be more effective in reducing power.


5 Minimizing Power Via Inverse Polarity Optimization

5.1 Introduction

Latch insertion fails to yield power savings when applied to Wallace trees because

although a great deal of false switching occurs in Wallace trees, this false switching is a

result of different effects (unequal logic depth, input-dependent delays, etc.) There is no one

effect which generates a large amount of false switching (e.g., in arrays, successive logic

depth skew), nor is there one problematic sub-block which makes an easy target for optimi-

zation. In Wallace trees, a more useful approach might be to make a small modification

which is pervasive throughout the tree; although false switching cannot be eliminated, its

effects may be reduced.

In this section, we investigate the possibility of removing some circuitry from the basic


rters

arith-

ple.

ck of

er

, the

ple-

Fig.

adder blocks of the multiplier, while maintaining the adder functionality. Ultimately, this is

the goal of logic synthesis: to use the minimum gate count necessary to implement a function.

Such a goal works towards minimizing the area required for the design, which in turn has ben-

eficial effects at the physical level (smaller die area, shorter interconnect, etc.) We consider a

technique which has been developed for delay optimization of adders, and we try to adapt it

for power optimization of multipliers.

5.2 Polarity Inversion

The technique which we term “inverse polarity” attempts to remove unnecessary inve

from the Wallace tree during partial product reduction. It has previously been applied to

metic logic as a delay reduction technique[25]. We illustrate the idea with a simple exam

Inverse polarity is applied to ripple adders in the following manner: two numbers, A = [a0

a1 ... an-1] and B = [b0 b1 ... bn-1] are added to form C = [c0 c1 ... cn-1]. Each carry-save adder

performs two functions: calculating the sum and calculating the carry. The carry sub-blo

the full adder consists of two logic stages: the inverted-carry ( ) stage and the invert

stage (Fig. 37a). For the ripple adder as a whole, in terms of the number of logic stages

delay is 2n. The inverse polarity technique attempts to remove the inverter stage by com

menting the input bits at every other bit position. In this manner, an adder of the form in

carry


Polarity Inversion

37b results, and the delay in terms of logic stages is n. Since the inverter has been removed,

there may be a loss in drive strength, when the input loads of the next stage are driven.

Therefore, these gates may need to be upsized. However, since the logic depth has been cut

in half, the overall delay may be significantly reduced.

Given that this removal of inverters does not change the logic function, we may use this

new configuration to apply another optimization. If devices retain the same size, a power

reduction may be obtained simply because the number of transistors implementing the adder

has been decreased. Note that in the example of Fig. 37, the actual number of inverters has

not actually been reduced, due to the need for inverting the inputs values at every other

stage. However, in practical cases, it may not be necessary to invert these values, and the

transistor count may then go down. If the gates are not upsized, it is not clear a priori what

will happen to the delay. Depending on the gate driving strength and associated capacitive

load, it may go up or down.


In multipliers, the majority of CMOS gates are found in the partial product (PP) reduction

stage. Closer analysis of the PP reduction network shows that the operation resembles a two-

dimensional ripple addition, with carries propagating to higher order bit positions, and sums

remaining at the same bit order (Fig. 37c) This insight leads to the idea of applying the inverse

polarity technique to the PP reduction stage.

In the following sections, we will show how inverse polarity circuits can be used in Wal-

lace tree multipliers to lower power dissipation. We describe inverse polarity circuitry and

develop a heuristic for implementing a Wallace tree PP reduction circuit using polarity inver-

b0a0

....carry

b1a1

carry

b2a2

carry

b0a0

carry

b1a1

carry

b2a2

....

(a)

(b)

Figure 37. Inverse polarity optimization (a) Conventional ripple adder. (b) Inverted polarity version (c) Multiplier PPA structure (array).

carry

....

....

....

(c)

pp pp

pp pp

....

....

....


Design Issues for Polarity Inversion

ple-

ing

ic

e first

ay be

size.

r

ly

place-

g in

nside

of a

sion. We examine the effects of the inverse polarity optimization on delay. Additional

effects, such as area and parasitic capacitance reduction are also investigated. In this discus-

sion, we denote the carry design with output inverter (Fig. 37a) as the ’conventional’ im

mentation.

5.3 Design Issues for Polarity Inversion

Digital CMOS logic design focuses on implementing a logic function while minimiz

transistor count, subject to delay and power constraints. To this end, many CMOS log

gates are designed to drive large interconnect lines by using a two-stage structure: th

stage implements the logic, and the second stage consists of a buffer (i.e., an inverter) to

drive output capacitance. In this manner, transistors implementing the logic function m

small while the buffer transistors can be made large, resulting in lower total transistor

While the logic stage/buffer structure provides strong drive for large loads, multiplie

circuits have the interesting characteristic that large net capacitances are not common

encountered. With a few exceptions, most connections are 2-point nets, and if proper

ment can ensure that connected components are located fairly close together, resultin

short wires, the effects of low drive on delay can be minimized. Therefore, other than i

the partial product generating cells and in some of the final adder cells, the advantage


28T”.

n

rep-

buffer structure is minimal. This suggests that removal of output inverters may realize lower

power dissipation with minor effects on delay. If a logically equivalent implementation using

fewer buffers can be assembled, the number of switching transistors will be reduced and there

should be a corresponding decrease in power dissipation. We apply this technique to Wallace

tree multipliers to determine resultant power savings.

5.3.1 Adder Circuit DesignsThe fundamental building block in digital multipliers is the full adder, used in the partial

product reduction phase to perform carry-save addition (therefore sometimes called a carry-

save adder, or CSA.) The CSA takes three inputs and calculates two outputs, sum and carry.

The most commonly used implementation (modified from [1]) is shown again in Fig. 38. As

mentioned in previously, this implementation has 28 transistors, hence the designation “

Note that this circuit incorporates the logic stage/buffer structure which is beneficial whe

driving large output capacitances.

The inverse polarity paradigm identifies bits as one of two polarities—that is, bits will

resent the results of additions, i.e., sum and carry (positive polarity—POS), or their comple-

ments sum and carry (negative polarity—NEG). Inverted polarity circuits require that an



he

a

ant:

adder with POS inputs provide NEG outputs and vice-versa. Therefore:

CSAIP(a,b,c) = (sum,carry)

CSAIP(a,b,c) = (sum, carry).

Here, the ’IP’ subscript denotes inverse polarity cells. The 28T implementation of t

CSA can be transformed into a CSAIP by simple removal of the inverters—this is because

complement of the circuit inputs to a CSA yields sum and carry. The half adder (HA) on the

other hand cannot be so easily constructed (see Fig. 39). For half adders, again, we w

HAIP(a,b) = (sum, carry)

HAIP(a,b) = (sum, carry).

b

carry

a

c a

b

ac

a ba c

ba c

a

b

c

a

b

c

sum

Figure 38. The 28T full adder (CSA) implementation.


h

le-

If the inputs to an HA are complemented, the ‘carry’ signal is not the dual of the carry sig-

nal (see Fig. 39) Therefore, two versions of the HAIP are required—one for POS inputs whic

yields sum and carry and another for NEG inputs that gives sum and carry. These are imp

mented using the circuits shown in Fig. 40)

bca 00 01 11 10

0

1

bca 00 01 11 10

0

1

sum(a,b,c)

carry(a,b,c)

1

0

0

0

0

0

1 1

1

1

1 1

1

0 0 0

bca 11 10 00 01

0

1

bca 11 10 00 01

1

0

1

0

0

0

0

0

1 1

1

1

1 1

1

0 0 0

Figure 39. Conventional implementation vs. inverse polarity equivalence (a) For full adders CSA(a,b,c) = CSA(a,b,c) (b) But for half adders, HA(a,b) != HA(a,b).

ba 0 1

0

1

ba 0 1

0

1

sum(a,b)

carry(a,b)

0

0

0

1

1

1

0 0

ba 1 0

0

1

ba 1 0

1

0

0

1

0

1

1

1

0 1

sum(a,b)

carry(a,b)

HA

CSAsum(a,b,c)

carry(a,b,c)



Therefore, although the CSA can be easily transformed into an inverse polarity version

by simply removing the output inverters, the HA cells need to be redesigned for when

implementing some inverse polarity versions.

5.3.2 Partial Product Reduction A greedy heuristic was proposed by Fadavi-Ardekani in [33] to construct a Wallace style

partial product reduction tree while minimizing logic depth. In this method, a priority queue

stores the bits for each column of the trapezoidal PP array, ordered by the largest static delay

time of the bit.

The algorithm proceeds on a column-by-column basis, starting at the lowest bit position,

where the earliest arrival bits are added using a CSA or HA; the resulting sum bit goes into

b

carry’

a

a

a

ab

b

b

sum’

b’

carry’

a’

a’

a’

a’

b’

b’

b’sum’

Figure 40. HAIP designs (a) POS inputs, NEG outputs, (b) NEG inputs, POS outputs


polar-

to

the priority queue of the current column, and the carry bit is placed in the queue of the next

column. (See Figure 41.) We can modify this basic scheme to construct inverse polarity based

multipliers. To use inverted polarity elements, we first require that all the inputs to a gate be of

the same polarity, either POS or NEG. In array multipliers, this can be achieved fairly easily,

since each logic level of adders can be of opposite polarities. In Wallace trees however, some

adders’ inputs come from signals of different logic levels (see Fig. 42). To create equal-

ity inputs, we must also track bit polarity and in some cases, inverters must be inserted

complement input bits.

Figure 41. Fadavi-Ardekani algorithm (a) Trapezoidal PP array — all bits in column n to be added, plus carry-in bits from column n-1. (b) Bits are put in a priority queue (one queue for each column, FA’s are applied to earliest arriving bits, yielding (c) two bit vectors to be added by the final adder.

(a) FAsc

(b)

(c)

carry-in’sfrom prev. col.



5.3.3 Inverse Polarity Wallace Tree AlgorithmOur goal is to minimize the number of inverters which need to be added to equalize

input bit polarity. This can be achieved by noting that in the case where there are a large

number of bits, we will always be able to find a usable number bits of the same polarity. In

particular, we wish to use 3-input CSA blocks to perform PP reduction. (We can also use 2-

input HA blocks, but using such blocks when CSAs could be used is wasteful.) If we are

using 3-input carry-save adders, if there are more than 4 bits present whose polarity is either

POS or NEG, we are guaranteed to be able to find 3 bits of the same polarity. Therefore,

inverters do not need to be used until we have reduced a column to 4 bits or less. At this

point, inverters may be inserted, if needed, to equalize bit polarities prior to addition.

(a) (b)

Figure 42. Inverse polarity - (a) arrays have regular alternating structure. (b) Wallacetrees have connections which “skip” a level, causing polarity conflicts at subsequentadder inputs.

? ?level skipped

level skipped


e

he PP

er-

the

save

ity

r

e low-

than 5.

are

e a

rity. In

tes an

mn.

ur-

The inverse polarity assembly starts in the partial product generation circuit—here th

AND gate can be replaced by NAND gates. The only effect this has is to change all of t

bits to a NEG polarity. We are doing two’s complement multiplication using the ’sign gen

ate’ method to precompute the ’1’ bits used for sign extension[26].

The detailed algorithm appears in Fig. 43. Given the inputs to the Wallace tree from

partial product generation stage, we wish to ’tile’ or ’cover’ the bits to be added by carry-

adders (i.e., CSAs and HAs), as described by Dadda[24]. To assemble the inverse polar

multiplier, (the TILE procedure) we provide two priority queues for each column, one fo

POS and one for NEG bits. Bits of the same polarity are selected from the queue with th

est delay bit. This procedure iterates until the total number of bits in both queues is less

In the STOP procedure, we attempt to reduce the number of bits down to 2. If there

originally 4 bits remaining, we use a CSA, otherwise (there are 3 bits remaining), we us

HA. When there are 3 or 4 bits, we cannot guarantee that the inputs are of the same pola

this case, we use inverters to normalize all bits to the same polarity.

Since this greedy assembly algorithm uses the lowest delay bits each time it instantia

adder, the procedure heuristically minimizes the growth of the maximum delay per colu

The stopping condition is the only point where extra inverters are inserted, for the sole p



Procedure TILE(column i) {/* put bits into queues POSi or NEGi; */

while( (#bits{POSi} + #bits{NEGi}) > 4 ) {

/* Pick queue to work on */if (earliest{POSi} < earliest {NEGi})

choose_Qi = POSi;

else choose_Qi = NEGi;

Instantiate IP adder {choose_Qi}

add earliest bits of choose_Qi;

put output bits on choose_Qi and choose_Qi+1;

} STOP((#bits{POSi} + #bits{NEGi}));

}

Procedure STOP(total){switch (total):case 3:

if (#bits{POS} >= 2) -> use HAIP on POS bits;

if (#bits{NEG} >= 2) -> use HAIP on NEG bits;

case 4:if (#bits{POS} >= 3 ) ->use CSAIP on POS bits;

if (#bits{NEG} >= 3 ) ->use CSAIP on NEG bits;

if (#bits{POS} == 2 and #bits{NEG} == 2) -> find latest bit (ltbit = POS/NEG)

-> use CSAIP on bits of type ltbit(use inverters to equalize inputs);

}

Figure 43. Basic inverted polarity CSA tiling algorithm.


less

are

to

delay

rity

racter-

tput

be

verse

have

pose of setting equal the input polarities of an adder. Once the partial product array has been

completely reduced, two bits are present at each bit position. Note that the polarities of these

final bits may well be mixed, i.e., at any given column, the final two bits may be both POS,

both NEG, or one POS and one NEG. At this point, a final adder is created to generate the final

result. This adder may also need to insert an inverter at some of its inputs, to equalize the

polarity of these inputs.

Note: even though extra inverters are ’inserted’, the total inverter count will always be

than the conventional multiplier. In inverse polarity construction algorithms, all inverters

initially removed from the design, then a few are selectively ’put back’. The net effect is

reduce transistor count.

5.3.4 Physical Design IssuesIn Wallace trees, interconnect capacitance is a non-negotiable factor in determining

and power characteristics of a multiplier. In comparing conventional versus inverse pola

designs, the output loading of each adder cell is also important in determining these cha

istics. In particular, delay values for inverse polarity designs are highly dependent on ou

load. For example, if the output load is negligible, inverse polarity implementations may

faster, due to reduced logic depth. On the other hand, if the output load is significant, in

polarity circuits will have higher delay, because with their output inverters omitted, they



relatively weak drive strength.

Our placement methodology initially followed the one developed for for our array ver-

sus Wallace tree experiments of Chapter 3, where we used a simple, grid-based simulated

annealing placement tool to create layouts. (Recall that we used a general placement tool to

determine the layout because Wallace trees do not easily lend themselves to a regular layout

structure.) Results for placing inverse polarity structures were disappointing in that they had

very high delay. We eventually discovered that the majority of the delay was located in the

final adder.

In theory, the final adder blocks should be placed fairly close together, resulting in very

short interconnect and therefore not be highly influenced by the low drive of the inverse

polarity circuits. In practice, since our wirelength minimizing simulated annealing algorithm

does not distinguish among functional blocks in determining placement, the goal of close

placement for components of the final adder was lost.

To improve placement characteristics, we developed a simple two phase layout method-

ology: 1) final adder blocks are arranged using a procedural placement technique, 2) simu-

lated annealing is used to group elements in the Wallace tree which have high connectivity.

The procedural placement technique attempts to arrange the final adder blocks next to the


en,

rows

al

ssary

or

outputs. It first calculates the total area necessary for the final adder’s components.. Th

based on the available width of the placement area, computes how many standard cell

are necessary for the block. Finally, the adder blocks are ’trowelled’ in until the entire fin

adder has been placed.

Estimates of the footprint required for each basic logic block were calculated for the

MOSIS HP 0.5 µm CMOS technology. The estimates were based on the layout area nece

for an inverter, and the area for general logic blocks was calculated as 1/2 their transist

(a)

(b)

inputs

outputs

width

0 1 2 3 4 5 6 7

0

1 2

3 4

5 6

7

(c)

Figure 44. Procedural assembly of final adder (a) Calculate width of final adder (b) based on width of placement area, (c) determine number of rows needed, and trowein the final adder.

Wallacetree

adder

31

31

30


Experimental Results using Layout Estimation

and

of

ation,

count times this base area.

Interconnect net length was calculated using a Steiner tree construction for multipoint

nets with Manhattan distances for each segment (0.165fF/µm cap. to ground.) Delays are

calculated using static timing analysis based on an HSPICE[51] characterization of cells.

Power is calculated using Star-sim[52] from Avant! Corp., which allows fast power compu-

tation with high accuracy. Similar to the results in Chapter 3, we repeatedly simulated sets of

20 vectors, so as to arrive at a confidence interval of 95% with a 5% error[10]. The power

numbers which we cite later on in the chapter are the average values of all simulations for a

given multiplier. Some sample runs we performed to test the resolution of Star-sim. We

found Star-sim’s accuracy to be within +/-2% of HSPICE runs.

5.4 Experimental Results using Layout Estimation

We designed a number of Wallace tree multipliers, one set with conventional logic

the other with inverse polarity circuits. The count of inverters present in two such sets

multipliers is shown in Table 6. These include the inverters in the partial product gener

Wallace tree, and final adder.


Table 7 and Table 8 compare energy per operation between conventional and inverse

polarity multipliers. In all cases, minimum size devices were used to minimize power con-

sumption. Results clearly indicate a power advantage for inverse polarity multipliers. A more

pronounced advantage is seen in larger multipliers with carry-select adders; these have the

greatest number of adder circuits, so reduced transistor count is most beneficial in these

cases..

Table 6: Count of inverters in multipliers

Multiplier Size Conventional Inverse Polarity

8 216 47

16 824 141

Table 7: Energy / operation for 8 bit multipliers

Final Adder Style

Conventional Inverse Polarity Power Reduction

Ripple 3.78e-11 J. 3.32e-11 J. 12.2%

Carry Select 4.93e-11 J. 4.63e-11 J. 6.1%

Table 8: Energy / operation for 16 bit multipliers

Final Adder Style

Conventional Inverse Polarity Power Reduction

Ripple 4.27e-9 J. 3.49e-9 J. 18.3%

Carry Select 5.45e-9 J. 4.13e-9 J. 24.2%


Experimental Results using Layout Estimation

A potential source of parasitic power dissipation in inverse polarity circuits is increased

short circuit (totem pole) current due to more slowly falling/rising inputs, which result from

the inverse polarity optimization. Detailed simulations found this effect to be insignificant.

Short circuit current involves a conducting path between the source and ground rails.

The magnitude of this current is proportional to the conductance of this path, which is a

function of the transistor W/L ratios in each stack. Given that all transistors are the same

size, the conductance of the inverter should be higher, since there are fewer transistors in the

stack (1 transistor) compared to the full adder block (from 2-3 transistors per stack, not

counting output inverters.) Due to this, the short circuit power of inverters should be higher.

We simulated the switching behavior of an inverter and a full adder. The average current

per nanosecond during discharge is as shown in the graphs below:

3.29E-05

1.46E-05

3.33E-05

2.23E-05

Figure 45. Average short circuit currents (a) full adders have lower short circuit currents than (b) inverters.

Short circuit currentCapacitance charging current


This shows that an inverter has slightly higher short circuit current than the full adders. We

expect that if the inverters were upsized (to drive higher loads), their short circuit current

would be even greater.

Delay, area and interconnect characteristics are shown in Table 9. These results show two

different architectures, conventional logic and inverse polarity logic, as well as two typs of

final adder, ripple and carry select. Inverse polarity circuits are slightly slower than conven-

tional circuits by 3% - 6%. Delay penalties can be eliminated by judicious use of carry-save

adders: conventional adders for the critical path and inverse polarity adders for non-critical

paths. The delay penalty is entirely due to interconnect capacitance of the Wallace tree; when

we simulated the design without interconnect, we found that the inverse polarity versions

were slightly faster. It is our belief that the Wallace tree layouts can be improved (through the

use of better placement tools.) Further refinement of the Wallace tree layout should thus

improve the performance of inverse polarity designs. Table 9 also shows area and interconnect

Table 9: Delay, area, wire cap. for 16 bit multipliers

WT-Ripple IP WT -Ripple WT -Carry Sel. IP WT-Carry Sel.

Delay 38.9 ns 40.0 ns 35.0 ns 36.9 ns

Area 8586 µm2 7368 µm2 9804 µm2 8576 µm2

Wire Cap. 11,872 fF 11,234 fF 13,290 fF 12,059 fF


Experiments with Detailed Layout

ze,

mag-

capacitance numbers for 16-bit multipliers. As expected, area-of-implementation is smaller.

Furthermore, total interconnect capacitance was less for inverse polarity designs, which is

due to smaller area of implementation; closer components require less wiring.

5.5 Experiments with Detailed Layout

The previous analysis was performed using physical design characteristics which were

estimated based on data taken from the SPICE models. We estimated the area needed for the

layout of logic elements by looking at existing standard cell libraries and extrapolating sizes

for the Hewlett Packard 0.5µm process. Wiring capacitance was determined by calculating

the capacitance of a Metal1 line to ground, which was multiplied by the length of the wire.

Based on these values, we created a layout with estimated interconnect wires, which gave a

measure of capacitive load.

5.5.1 Enhanced MethodologyTo conclude this analysis, we would like to analyze our multipliers using realistic lay-

outs. There are several important effects which might be under-represented by our earlier

estimation methodology. First, the area of each cell was computed as a function of the num-

ber of transistors—although there is a correlation between transistor count and cell si

there will be a certain amount of error involved. Furthermore, we are not aware of the


ini-

,

es of

n-

ere

nitude of cell parasitic capacitances. Secondly, the wire capacitance was estimated by global

wiring, which was calculated as simply the Manhattan distance between components. This

will be off because detailed wiring may be longer than the Manhattan distance, if the intercon-

nect is required to take a "snake"-like path, for example, due to heavy wiring congestion

between two connected components. Finally, if delay is significantly altered by interconnect,

this will impact power dissipation, as explained in earlier chapters: greater skew between sig-

nals at the input of components will create more false switching.

To estimate the above effects, we developed a standard cell library in the modern STMi-

croelectronics .25µm CMOS process, using the Cadence tool flow for detailed geometry gen-

eration. The flow, shown in Fig. 46, is as follows:

• Cell library development: we developed a standard cell library, using designs from the

existing STMicroelectronics HCMOS7 standard cell library, where applicable. The m

mum dimensions of transistors in the ST HCMOS7 library seem to be W = .75µm, L =

.25µm, so we took these dimensions to be "minimum size". There were 7 basic cells

which we used to form 22 standard cells—for example, the CSA is composed of 3 typ

basic cells: a carry stage, a sum stage and two inverters. Metal1 is used for local interco

nection, although for high congestion cells, polysilicon or Metal2 can be used for

• additional routing. Metal1 is not restricted to follow a grid, but it does follow a grid wh



lated

e

tal cell

possible (so that later routing tools can use metal1, where it is left over.) The layers used

for routing are Metal2 (vertical), Metal3 (horizontal) & Metal4 (vertical).

• Create placement: as described previously, our placement is based on a simple simu

annealing algorithm, which attempts to minimize total wirelength while optimizing

(eliminating) overlap of logic blocks. In some cases, the final placements had som

overlap remaining, which was fixed by hand. The placement area was based on to

create design

generate DEF

Silicon Ensemble (routing)

Import to Cadence

DIVA (extraction)

Netlisting of Capacitances

Compare Capacitances Rerun static timing

Figure 46. Physical design verification methodology.

create standard cell library

generate LEF


gs

inute

en

ST

trac-

bit

. As

ation

xper-

ou-

paral-

size, plus a margin of about 10%. Different aspect ratios were examined to see what effect

these had on the wiring length. Our resulting placement was encoded into the Cadence

DEF format, which was passed to Cadence Silicon Ensemble for routing, along with a

LEF description of the cell library.

• Wire with Silicon Ensemble: the routing stage was uncomplicated—using default settin

of Silicon Ensemble, we performed global and detailed routing. The routing took 1 m

for a 4 bit multiplier and 5 minutes for an 8 bit multiplier. The resulting routing was th

exported to the Cadence Design Framework.

• Extract, Back Annotate, Simulate: the layout was then verified and extracted using the

.25µm rules with the DIVA extraction program. We encountered problems with the ex

tion stage, as DIVA has the tendency to rename nets in the extraction process. For 4

multipliers, about 5 nets were renamed; for 8 bit multipliers, 30+ nets were renamed

we were unable to determine the reason for net renaming, this was the primary limit

on the size of multipliers which can be investigated using this methodology; all our e

iments were performed on 8 bit multipliers.

Note that the versions of the rules for DIVA that are available to us, adjacent wiring c

pling is not extracted. Therefore, the capacitance numbers are the capacitance of a

lel-plate-capacitor Metal1 line to ground, plus the fringe capacitance.



5.5.2 Placement DetailsOur placement stage consists in defining a placement area, along with I/O pin locations.

The size of the area is determined by the total area of the components to be placed, plus

some additional area, typically 5-10% extra. We experimented with different aspect ratios,

to determine the effect of shape on overall capacitance. Two sample multipliers are shown in

Fig. 47, the layout on the top is a 4-bit conventional multiplier, and the layout on the bottom

an 8-bit inverse polarity multiplier. Note that the 4-bit multiplier has an aspect ratio of

1.375, while the 8-bit multiplier has an aspect ratio of 1.19.


(a)

(b)

Figure 47. Placement for (a) 4-bit conventional multiplier and (b) 8-bit inverse polarity multiplier. The final adder is shown in dotted lines.



In these layouts, the inputs are on the bottom. The lowest order output bit is on the top

left of the placement, and the highest order output bit is on the top right. The final adder

stage comprises the top two rows of the placement area. However, in some cases, the adder

does not completely fill up the top rows, and components from the Wallace tree are placed in

the remaining empty slots (e.g., in the 4-bit adder of Fig. 47a.) Time for our placement algo-

rithms took about 10 minutes for 4-bit adders and to up to 15 hours for 16-bit multipliers.

We were nearly always able to place multipliers without overlap, although in some cases we

had to resolve the overlap by hand. Clearly, a less restrictive placement area and more

sophisticated placement algorithm would result in a better layout with no overlap.

For nearly all types of multipliers, the conventional multiplier versions were harder to

place than inverse polarity multipliers; this can be attributed to more total area of compo-

nents to be placed, resulting in greater possibility of overlap.

5.5.3 Interconnect DetailsOur extraction experiments were run on conventional and inverse polarity 4-bit and 8-bit

multipliers, using various placement areas. We initially investigated large aspect ratios (lay-

out is wider than it is tall), for the reasons given below. The wiring for 8-bit multipliers can

be seen in Fig. 48.


Figure 48. Routing of 8-bit multipliers - (top) Conventional 8-bit multiplier and (bottom) inverse polarity 8-bit multiplier.



In these layouts, interconnect was routed on metals 3 and 4. Metal 1 was used for intra-

cell routing, and metal 2 was used for power and ground distribution. (It may be possible to

reduce congestion, by allowing metal2 to be used for routing. Currently, metal 2 is partially

used in the routing, but it is under-utilized as the power lines are routed in the "wrong way"

direction.) Routing was performed using Silicon Ensemble from Cadence, initially perform-

ing a global routing phase, and then performing local routing. All routing was completed

without overflows, and took about 5 minutes (both global and local routing). Extraction was

performed using the Cadence DIVA tool, within the methodology of the ST .25µm design

flow.

The resulting interconnect capacitances are shown in Table 10. We can see that inverse

polarity circuits have less overall interconnect, which is a direct consequence of smaller cir-

cuit count. These capacitances were also back-annotated into our timing tool to determine

Table 10: Simulated and extracted data

4-bitconventional

4-bit inverse polarity

8-bit conventional

8-bit inverse polarity

Capacitance estimate 229.185 fF 222.75 fF 2072.9 fF 1637.79 fF

Static Delay 3317 ps 3603 ps 7231 ps 7704 ps

Extracted Capacitance 228.152 fF 201.398 fF 2248 fF 1702 fF

Delay (ext. cap. based). 3336 ps 3498 ps 7373 ps 7693 ps


the longest path through static timing analysis. This data confirms that our capacitance esti-

mates are fairly accurate.

A historgram of nets for the 8-bit inverse polarity multiplier is shown in Fig. 49, to com-

pare the distribution of estimated net capacitance with extracted net capacitance. We see that

the calculated values are very similar, with a slight tendency for the estimation to under-count

the capacitance.

There were several factors which caused the extracted real capacitance to be at variance

with our values.

Count of Net Capacitances

0

10

20

30

40

50

0 5 10 15 20 25 30 35

Capacitance (fF)

Net

Co

un

t Estimated Capacitance

Extracted Capacitance

Figure 49. Estimated versus extracted net capacitances for an 8-bit inverse polarity multiplier



ur

can be

wires

. Ini-

sured

was

ere-

to the

large

cent

nifi-

ur

ns,

ce =

ks that

• Underestimation: during our routing estimation stage, we determined the length of o

global routes to be the manhattan distance between components. This assumption

wrong simply because this does not take into account congestion, which can cause

to take a path which is longer that the basic manhattan distance.

• Overestimation: our capacitance calculation step can also overestimate capacitance

tially, when assuming capacitance was proportional to the length of a wire, we mea

the capacitance of a metal1 line to ground. In our methodology, much of the wiring

performed on metals 3 and 4, which have a much lower capacitance to ground. Th

fore, we tend to overcalculate the capacitance of a given line.

For these experiments, we found that our capacitance estimates were fairly close

actual extracted values of routed examples. In practice, we have seen that although a

amount of wiring is present on the multiplier, we do not see very long lines running adja

to each other.

Of concern is the added capacitance of adjacent wires, which starts to become sig

cant in deep submicron processes. We believe that although this effect is present in o

designs, its impact will be small, for several reasons. In our designs, with few execptio

wiring capacitance is a small amount of the overall node capacitance (node capacitan

interconnect capacitance + gate capacitance). This is because the basic building bloc


nts

onges-

ct

partial

we are using, CSAs, have inputs which drive 6-7 transistors each. Therefore, even for mini-

mum sized gates, the capacitance of the inputs are each on the order of 20fF (.25µm process.)

Long wires have capacitance on the order of 17-20 fF. Note furthermore that very few long

wires are encountered, the majority being small to medium sized wires. Therefore, wiring

effects do not dominate the count of total switched capacitance.

5.5.4 Aspect Ratio DetailsWe experimented with different aspect ratios, due to the possibility of wiring congestion in

certain areas of the physical design, particularly after the partial product generation stage. Ide-

ally, the layout of a multiplier might look like Fig. 50. This layout has been designed to opti-

mize several characteristics. First, note that all signal flow is "top to bottom"—compone

which are at deeper logic depths are placed closer to the output pins. This can reduce c

tion by ensuring that no signals flow "bottom to top". Next, note that all the partial produ

generators are clustered at the top of the layout. Since the inputs must fan out to many

product generators, clustering these at the top helps reduce interconnect length.



is

ing the

n the

A problem occurs if we consider the number of connections which must cross the

boundary between PPA generators and the Wallace tree. In general, this number will be

O(n2) for an n-bit multiplier—for large multipliers, we can expect high congestion at th

interface1.

There are several ways to alleviate such congestion. First, we can avoid segregat

PPA generators and the Wallace tree adders, and instead put some PPA generators i

1. At this point, we see one of the advantages of Booth recoding, which can cut the number of PPA bits in half. For this given layout scheme, the number of wires crossing this interface is also cut in half.

Final Adder

Wallace tree

PPA Generators

etc.

Inputs

Wallacetree

Final adder

PPA generators

(a) (b)

Figure 50. Physical design of Wallace tree multiplier (a) logical structure, (b) idealized physical layout.

inputs

outputs


case,

o

out,

ity

were

on be

inly

rtant

rse

r

h the

suc-

Wallace tree. This approach eliminates the "top-to-bottom" signal flow, and may lead to

excessive wire length. Second, with multiple-metal-layer technologies, we are not restricted to

one plane pair—these signals can be distributed across multiple levels. In the extreme

we can go away from the HVHV routing paradigm, using some of the horizontal layers t

route signals in the vertical direction. Finally, we can increase the aspect ratio of the lay

using footprints which are wider than they are tall. For these layouts, more wiring capac

will be available in the vertical direction.

Our initial designs had aspect ratios slightly greater than ’1’, resulting in designs that

somewhat wider than tall. However, in practice we do not expect to see wiring congesti

that great of a problem in DSP sized multipliers (that is, small 8-16 bit multipliers), certa

not on the scale present in microprocessor floating point multipliers. Therefore, it is impo

to see if such aspect ratios, although desirable for congestion reasons, may have adve

effects on wiring capacitance. We then attempted to determine the impact of using wide

aspect ratios.

We laid out two versions of our 8-bit multiplier using narrower and taller layouts, and

determined wiring capacitance. The wiring of these designs is shown in Fig. 51. Althoug

aspect ratio is close to ’1’, there were no routing problems, and all nets were completed

cessfully.



Figure 51. Routing of 8-bit multipliers, aspect ratio ~1 - (a) Conventional 8-bit

Analysis and Design of Low Power Multipliers 151multiplier and (b) inverse polarity 8-bit multiplier.

on-

pect

ed at

n-

ts cer-

is an

l char-

ical

ign

The impact of aspect ratio on wiring capacitance is illustrated in Table 11. For these

designs, the height was fixed and the width was allowed to vary. The height measurement

consists of 9µm per standard cell row, plus 10 µm total offset of the pins at the top an bottom

of the layout. We aimed for an aspect ratio of ’1’ for the inverse polarity multiplier—the c

ventional version was slightly wider. As can be seen from the results, a more square as

ratio results in shorter wires and less capacitance. This conclusion is similar to that arriv

when placing general logic blocks. Use of inverse polarity circuits provides certain adva

tages in multiplier construction, but they must be applied judiciously, as their use impac

tain characteristics, most notably delay. We will discuss some of the reasons why delay

issue in inverse polarity circuits, then talk briefly about noise optimization issues.

5.6 Additional Design Considerations

In creating and simulating inverse polarity designs, we were able to discover severa

acteristics about the inverse polarity technique, the way it was assembled, and its phys

design characteristics. In addition to providing power reduction, the inverse polarity des

Table 11: Aspect ratio and capacitance

8-bit conv. version 1

8-bit inv. pol.version 1

8-bit conv.version 2

8-bit inv. pol.version 2

Layout area 180 x 100 µm 160 x 100 µm 140 x 118 µm 120 x 118 µm

Aspect Ratio 1.80 1.60 1.19 1.02

Capacitance estimate 2072.9 fF 1702.8 fF 1754.28 fF 1521.14 fF


Additional Design Considerations

rse

at

cuits

epth.

technique is a low-logic depth, low-drive technology. There are interesting implications for

circuit delay, as well as noise properties (signals coupling onto wires from adjacent nets).

5.6.1 Wiring Delay The delay of a circuit is based on drive logic depth, drive strength, and load capacitance.

In using inverse polarity circuits, logic depth is decreased, drive strength is decreased, and

load capacitance (seen at the output of adders) remains fairly constant, although there is a

slight tendency for it to decrease, as smaller die area leads to shorter interconnect, which

generally leads to lower interconnect capacitance. Although the reduction in capacitance is

generally true, capacitance is also a function of the presence of adjacent wires, especially in

deep submicron designs. Therefore, it is uncertain whether the reduction in die area will lead

to significant capacitance reduction.

In attempting to optimize delay, we used the same approach for both conventional and

inverse polarity designs, so as to not favor one over the other. Although the above men-

tioned results show a slight delay penalty in using inverse polarity circuits, again, we believe

that further refinement of the design flow will yield reductions in delay—since the inve

polarity circuits have a greater portion of their delay due to interconnect, we believe th

interconnect reduction techniques are more likely to yield results in inverse polarity cir

than in conventional circuits.

Another interesting characteristic of inverse polarity designs is their smaller logic d


If gate upsizing is applied to logic circuits, delay is reduced up to a certain point, beyond

which gate upsizing fails to reduce the delay. At this point, the delay of a logic function is

closely related to its logic depth; therefore, under maximum gate upsizing, inverse polarity

designs should be faster. It remains unclear whether gate upsizing ever reaches this regime or

whether this is simply an ’extreme-case’ effect.

5.6.2 ’Logic-based’ DelayOur PP reduction tree assembly algorithm was based on the work of Ardekani[33], who

presented a simple, efficient heuristic for assembling a Wallace tree, while minimizing delay.

The algorithm proceeds on a column-by-column basis, from bit position 0, to the highest bit

position. In each column, CSA blocks are used to reduce the all bits down to two, which will

be the inputs to the final adder. (The operation was described in Fig. 41.)

Our algorithm is very similar, with the exception that we use two priority queues to store

the bit arrival times, so as to keep track of POS and NEG input bits separately. Ardekani also

does not take into account the input-dependent delay information (different inputs to the same

function have different delays.) There is a problem in using this kind of assembly algorithm

for inverse polarity. Consider the case of Fig. 41b. Here we see that a CSA has been applied to

reduce three bits in a column, and the sum bit is placed back in the priority queue for this col-

umn while the carry it is placed in a priority queue for the next column.



e—the

ract-

When the algorithm is applied to the next column, this carry bit will be connected to the

input of a CSA. This input will add extra load to the output of the carry logic, which will

affect the delay of the sum bit in the previous column (Fig. 52c). In the conventional case, an

inverter is present on the output of the carry, and this prevents loading of the previous col-

umn (Fig. 52b).

This type of loading can have a significant effect on the overall delay of the multiplier. If

several of these types of loading are present in a critical path, the effect can be sever

effect is less important if these types of loading are distributed among several non-inte

FAs

c’

s’

cc’

s’

Figure 52. Inverse polarity loading effects (a) Use of full adder in PP reduction. (b) Circuit implementation of "(a)" using conventional full adder. (c) Implementation using inverse polarity full adder—note that load on carry output affects delay of sum output.

(a) (b) (c)


of

nents

ompo-

do

ts

pling

ing paths. This is especially problematic if the carry output of an inverse polarity circuit is

connected to a long interconnect stage. The effect of such loading causes delays in the previ-

ous column to grow. This means that the optimal mapping for minimum delay, which was the

goal of our algorithm, will be thrown off and delays will result that were greater than

expected.

To counter this effect, a more complex PP reduction assembly scheme may be desirable.

For example, the assembly algorithm could make several passes, re-mapping certain connec-

tions which have excessive load. In some cases, delay can be improved at the expense of

power—an inverter can be placed on the output of the carry outputs which are particularly

problematic. Finally, in the physical design stage, one should try to minimize the loading

such carry outputs. For example, in the placement stage, one could cluster some compo

together to reduce interconnect, and then place these clusters instead of placing each c

nent separately.

5.6.3 Noise IssuesAn interesting and potentially useful characteristic of inverted polarity circuits has to

with longer rise- and falltimes (switching times). Since buffer removal implies that outpu

will have shallower slopes (for same-size transistors), the outputs will be weaker at cou

noise onto adjacent lines. A well-known maxim for high speed design [3] is: given two logic



families with identical maximum propagation delay statistics (50% input-50% output

switching points), the family with the slowest output switching time will be cheaper and eas-

ier to use. This is because signals with slow rise and falltimes contain fewer high frequency

components, and therefore couple less noise onto other lines The inverted polarity structure

removes buffers and thereby lowers the drive strengths of CSAs and half adders. Therefore,

the reduction of high drive signals may be a secondary advantage of this technique.

However, one should be aware that nodes which are driven with less strength are them-

selves more susceptible to noise injection. This is because the interconnect line which is

being driven is connected to one of the rails (either Vdd or Vgnd) by a greater impedance.

Hence this node is more isolated from the power supplies and is more easily pulled to differ-

ent voltages by adjacently coupled nodes (i.e., bootstrapped.)

Given these two effects, it is unclear whether there is a net advantage or disadvantage as

far as noise immunity is concerned. Ultimately, this may depend on the circuitry surround-

ing a multiplier. If the multiplier has been a source of noise, the inverse polarity technique

can help reduce this. However, if the multiplier has been susceptible to noise from neigh-

bors, an inverse polarity multiplier will be more so.

5.6.4 Further OptimizationsThe inverse polarity optimization has been shown to effectively reduce transistor count,


as well as area-of-implementation, resulting in lower capacitance and therefore less power

consumed. The reduction in drive strength, although partly compensated by a lowering of the

logic depth, results in an increase in delay. All these results hold for minimum sized transis-

tors.

There are several issues which should be addressed when implementing the inverse polar-

ity technique. First, a reduction in the delay of multipliers can be obtained by upsizing

devices. It remains unclear as to whether inverse polarity circuits will be beneficial in this

case. On the one hand, an inverse polarity stage has low drive, so delay would benefit

immensely from device upsizing. However, since a large number of transistors need to be

upsized (as opposed to the conventional buffered circuit, where only the buffer has to be

upsized.) The power-efficiency of inverse polarity circuits may be quickly lost once upsizing

takes place. Nevertheless, one can argue that in the extreme case, the inverse polarity multi-

plier with its smaller logic depth can achieve asymptotically lower delay than the conven-

tional version.

An alternative approach would be to mix inverse polarity adders with conventional adders

in multipliers. The logic circuit of multipliers is like standard combinational logic in that it has

a critical delay path, as well a number of less-than-critical paths. If conventional adders can

achieve higher delay, they can be used on the critical path, and inverse polarity adders can be


Summary

used solely for off-critical path elements. Note that this would increase the delay of signal

paths that were previously non-critical, so care must be taken to avoid previously non-criti-

cal paths from becoming critical. In these cases, the power savings depend on the number of

circuit elements which can be replaced by inverse-polarity circuits.

Finally, our analysis of multiplier layout used simulated annealing based designs which

employed procedural placement techniques for the final adder. Although annealing has the

advantageous property that it can be applied to general designs without requiring knowledge

of underlying structure, the loss of structural information may lead to sub-optimal results. In

our layouts, we found that on some occasions, components whose interconnection lay on the

critical path were placed far apart, whereas other components which were less critical were

placed close together. It seems that if more structural information could be provided to the

annealing algorithm, better placements could be achieved.

5.7 Summary

We presented a method of reducing transistor count in Wallace tree multipliers through

reducing the number of inverters needed to complete the operation. Results indicate that up

to 25% power reduction can be achieved for minimum-sized multipliers. There is a small

delay penalty associated with this technique, due to reduced drive strengths. Further gains


were achieved in overall interconnect capacitance and die area.

In contrast with latch insertion, we achieved significant power reductions. At a basic level,

this is understandable, since we removed circuitry, while latch insertion adds circuitry. In

order to amortize the additional power dissipation which results from adding circuitry, the

amount of false switching that must be eliminated is quite large. In Wallace trees, we have

seen that the switching activity is much less than in arrays. Therefore, it is not surprising that

the inverse polarity optimization achieved power reduction while latch insertion experiments

failed to produce similar power savings.


6 Conclusions

Power dissipation in multiplier designs has been much-researched in recent years, due to

the importance of the multiplier circuit in a wide variety of microelectronic systems. The

focus of multiplier design has traditionally been delay optimization, although this design

goal has recently been supplemented by power consumption considerations. Our goal has

been first to understand how power is dissipated in multipliers, and secondly to devise ways

to reduce this power consumption.

In this thesis, we described previous work which has been done in the area of multiplier

delay and power optimization. We identified methods by which multiplier delay has been

reduced, and we concentrated on understanding how these various speedup techniques

impact the power dissipation of the multiplier as a whole.


In Chapter 3, we investigated the application of arrays and Wallace trees to partial product

reduction. Conflicting views on the power dissipation characteristics of these two techniques

led us to more closely analyze switching behavior and interconnect loading characteristics.

We devised a simulation environment and a layout model which allowed efficient investiga-

tion of various multiplier configurations, each having distinct delay and power consumption

characteristics. We concluded that while Wallace trees offer a decided delay advantage over

array schemes, they also offer a power advantage over arrays, as the false switching in array

designs overcomes the power consumed by long interconnect in Wallace trees.

We decided to focus on Wallace trees for power reduction, since these have become the

architecture of choice in recent chip designs, due to their better delay properties. False switch-

ing in Wallace trees can be reduced through the use of latches, as described in Chapter 4.

Unfortunately, the power cost of the latches and the circuitry needed to generate the timing

signal tends to overwhelm the power savings generated by reduced switching in the Wallace

tree. Although applications of latches have proven effective in array schemes, we concluded

that Wallace trees do not benefit from latch insertion.

A more useful approach to reducing power in Wallace trees is reducing circuit count

through removal of redundant inverters. This can be achieved using the inverse polarity opti-

mization, which is described in Chapter 5. By keeping track of bit polarities as the Wallace


tree is constructed, we arrive at designs which have reduced circuit count. Power savings of

up to 25% were achieved, along with reductions in die area and interconnect. Delay penal-

ties are present, but may be alleviated through a hybird inverse polarity/conventional con-

struction.

In conclusion, we have presented an investigation of multiplier power dissipation, along

with some techniques which allow reductions in power consumption for this circuit. Given

the importance of multipliers, it is likely that further research efforts will be directed at opti-

mizing this block for delay and power efficiency.


7 Bibliography

izing

[1] N. Weste and K.Eschraghian, Principles of CMOS VLSI Design, Addison-Wesley,1988, p. 312.

[2] C. Mead and L. Conway, Introduction to VLSI Systems, Addison-Wesley, 1980, p. 166.

[3] H.W. Johnson, and M. Graham, High Speed Digital Design, Prentice-Hall Inc., NewJersey, 1993, p. 60.

[4] T. Cormen, C. Leiserson, R. Rivest, Introduction to Algorithms, McGraw-Hill Publish-ers, Massachusetts, 1990.

[5] R. Hitchcock, Sr., "Timing Verification and the Timing Analysis Program," IEEEDesign Automation Conference, 1982, pp. 594-604.

[6] A.P. Chandrakasan, S. Sheng, and R.W. Brodersen, “Low Power Digital Design,” IEEEJournal of Solid State Circuits, April 1992, vol. 27, pp. 473-484.

[7] A.P. Chandrakasan, M.Potkonjak, R. Mehra, J. Rabaey, R.W. Brodersen, “OptimPower Using Transformations,” IEEE Transactions on CAD, Jan. 1995, Vol. 14, No.1pp.12-31.


to

,

ng,

ir-

lt-

ing

od,"

[8] A. Ghosh, S. Devada, K. Keutzer, J White, "Estimation of Average Switching Activity inCombinational and Sequential Circuits," IEEE Design Automation Conference, 1992, pp.253-259.

[9] F. Najm, "Transition Density: A New Measure of Activity in Digital Circuits," IEEETransactions on CAD, 1993, Vol. 12, No. 2, pp. 310-323.

[10] R. Burch, F. Najm, P. Yang, and T. Trick, “McPOWER: A Monte Carlo ApproachPower Estimation,” IEEE/ACM International Conference on Computer Aided Design,1992, pp.90-97.

[11] F. Najm, "A Survey of Power Estimation Techniques in VLSI Circuits," IEEE Transac-tions on VLSI, 1994, Vol.2 No.4, pp.446-454.

[12] J.M. Rabaey and M. Pedram, Low Power Design Methodologies, Kluwer Academic Pub-lishers, Norwell, Mass., 1996.

[13]G.K. Yeap, Practical Low Power Digital VLSI Design, Kluwer Academic PublishersNorwell, Mass., 1998.

[14]T. Sakurai, H. Kawaguchi, and T. Kuroda, " Low-Power CMOS Design through Vth Con-trol and Low-Swing Circuits," Proceedings of the 1997 ISLPED, pp.1-6.

[15]C.H.Tan and J.Allen, "Minimization of Power in VLSI Circuits Using Transistor SiziInput Ordering, and Statistical Power Estimation," International Workshop on Low PowerDesign 1994, pp.75-80

[16]M. Borah, R.M Owens, and M.J. Irwin, "Transistor Sizing for Low Power CMOS Ccuits," IEEE Transactions on CAD, June 1996, vol. 15, no. 6, pp. 665-671.

[17]K. Usami et. al. , “Automated Low-power Technique Exploiting Multiple Supply Voages Applied to a Media Processor,” IEEE Custom Integrated Circuits Conference, 1977,pp. 579-586.

[18]R.K. Krishnamurthy, and L.R. Carley, "Exploring the Design Space of Mixed SwQuadrail for Low-Power Digital Circuits," IEEE Transactions on VLSI, Dec. 1997, Vol.5,No. 4, pp. 388-400.

[19]P. Landman and J. Rabaey, "Architectural Power Analysis: The Dual Bity Type MethIEEE Transactaions on VLSI Systems, June 1995, pp. 173-187.


tic,"

gie

ck-

000

Pub-

FIR

[20] L.S. Nielsen, J. Sparso, "Designing asynchronous circuits for low power: an IFIR filterbank for a digital hearing aid," Proceedings of the IEEE, vol.87, no.2, p. 268-81

[21] S. Devadas, S. Malik, "A survey of optimization techniques targeting low powerVLSI," 32nd Design Automation Conference. Proceedings 1995, p. 242-7.

[22] T.L. Martin, D.P. Siewiorek,"A Power Metric for Mobile Systems," International Sym-posium on Low Power Electronics and Design, 1996, pp. 37-42.

[23] C.S. Wallace, “Suggestions for a Fast Multiplier,” IEE Trans. Electron. Computers,1964, EC-13, pp. 14-17.

[24]L. Dadda, “Some Schemes for Parallel Multipliers,” Alta Freq., 1965, 34:349-356.

[25]S.D. Pezaris, "A 40-ns 17-bit by 17-bit Array Multiplier," IEEE Transactions on Com-puters, vol.c-20, no.4, p. 442-7

[26]W.J. Cody, (chairman), "A Proposed Standard for Binary Floating-Point ArithmeCOMPUTER Magazine, 1981, special reprint.

[27]M. Annaratone and W.Z.Shen, “The Design of an LSI Booth Multiplier,” CarneMellon University, Thesis Report, no. CMU-CS-84-150.

[28]M. Santoro and M. Horowitz, “SPIM: A pipelined 64x64-bit Iterative Multiplier,” IEEEJournal of Solid-State Circuits, April 1989, Vol. 24, No. 2, pp.487-493.

[29]M. Borah, R. Owens, M.J Irwin, “High-throughput and Low-power DSP Using CloCMOS Circuitry,” International Symposium on Low Power Design, 1995, pp. 139-144.

[30]R.K. Montoye E. Hokenek, S.L. Runyon, “Design of the IBM RISC System/6Floating-Point Execution Unit,” IBM Journal of Research and Development, January1990, Vol. 34, No. 1, pp. 59-77.

[31]D. Goldberg, “Computer Arithmetic,” in Computer Architecture A QuantitativeApproach, J.L. Hennessy and D.A. Patterson, San Mateo, CA: Morgan Kaufmannlishers, Inc., 1990, pp. A1-A66.

[32]L.E. Thon, P. Sutardja, F. Lai and G. Coleman, “A 240MHz 8-Tap ProgrammableFilter for Disk-Drive Read Channels,” IEEE International Solid State Circuits Confer-ence, 1995, pp. 82-83.


al-.

eger

ub-

duc-

ing

out

[33] J. Fadavi-Ardekani, “M x N Booth Encoded Multiplier Generator Using Optimized Wlace Trees,” IEEE Transactions on VLSI Systems, June 1993, Vol. 1, No. 2, pp. 120-125

[34]D. Carlson, et. al., "A 677MHz RISC Microprocessor Containing a 6.0ns 64b IntMultiplier," IEEE International Solid State Circuits Conference, 1998, pp. 294-295.

[35]B. Ackland, et. al., “A Single-Chip 1.6 Billion 16-bit MAC/s Multiprocessor DSP,” IEEECustom Integrated Circuits Conference, 1999, pp. 537-540.

[36]Y. Hagihara, et. al.,"A 2.7ns 0.25mm CMOS 54x54b Multiplier," IEEE InternationalSolid State Circuits Conference, 1998, pp. 296-297.

[37]T.K. Callaway, and E.E. Swartzlander Jr., “Low Power Arithmetic Components.” in LowPower Design Methodologies, Rabey, J. and Pedram, M., eds., Kluwer Academic Plishers, Norwell, Mass, 1996, pp. 161-198.

[38]A. Bellaouar, and M.I. Elmasry, Low-Power Digital VLSI Design, Circuits and Systems,pp. 442-450, Norwell, Mass: Kluwer Academic Publishers, 1995.

[39]B. Ackland, C.J. Nicol, “High Performance DSPs - What’s Hot and What’s Not?” Inter-national Symposium on Low Power Electronics and Design, 1998, pp. 1-6.

[40]P. Larsson, C.J. Nicol, “Transition Reduction in Carry-Save Adder Trees, ” InternationalSymposium on Low Power Electronics and Design, 1996, pp. 76-79.

[41]C.J. Nicol, P. Larsson, “Low Power Multiplication for FIR Filtering, ” International Sym-posium on Low Power Electronics and Design, 1997, pp. 76-79.

[42]R. Fried, “Minimizing Energy Dissipation in High-Speed Multipliers,” InternationalSymposium on Low Power Electronics and Design, 1997, pp. 214-219.

[43]C. Lemmonds and S. Shetti, “A Low Power 16 by 16 Multiplier using Transition Retion Circuitry,” International Symposium on Low Power Design, 1994, pp. 139-142.

[44]E.d.Angel, "Low Power Digital Multiplication," in Application Specific Processors, E.E.Swartzlander, ed., Kluwer Academic Publishers, Norwell, Mass, 1997.

[45]E. Musoll and J. Cortadella, “Low-Power Array Multipliers with Transition-RetainBarriers,” Fifth International Workshop on Power and Timing Modeling, October 1995.

[46]P. Meier, R. Rutenbar, and L.R. Carley, “Exploring Multiplier Architecture and Layfor Low Power,” IEEE Custom Integrated Circuits Conference 1996, pp. 513-516.


tions

CA,

[47] R. Rutenbar, “Simulated Annealing Algorithms: An Overview,” IEEE Circuits andDevices Magazine, Jan. 1989, pp. 19-26.

[48]M. Sivaraman and A.J. Strojwas, “Towards Incorporating Device Parameter Variain Timing Analysis”, Proceedings of the European Design Conference, 1994, pp. 338-342.

[49] J. Fishburn, "Switch Level Tools" in Algorithms and Techniques for VLSI Layout Syn-thesis by D. Hill, et. al., Kluwer Academic Publishers, 1989, pp. 153-179.

[50] J. Fishburn, “A Depth-Decreasing Heuristic for Combinational Logic,” 27th ACM/IEEE Design Automation Conference, 1990, pp. 361-364.

[51]HSPICE User’s Guide, Meta-Software Inc. (now Avant! Corporation, Fremont, 1992)

[52]Star-Sim User’s Guide, Avant! Corporation, Fremont, CA, June 1997.


analysis and design of low power digital multipliers · figure 1. digital multiplication flow. 21...

Documents