chapter 3 high speed multiplier design -...

51

CHAPTER 3

HIGH SPEED MULTIPLIER DESIGN

3.1 INTRODUCTION

Digital Signal Processors (DSPs) and application specific integrated

circuits rely on the efficient implementation of arithmetic circuits to execute

dedicated algorithms such as convolution, correlation and filtering. In this

chapter, two new techniques are proposed to implement multiplier circuits.

The first technique uses decomposition logic and improves the overall power

delay product of the multiplier. Decomposition algorithm is tested on

different multiplier circuits such as Carry-Save multiplier, Wallace multiplier

and Dadda multiplier. In this chapter, a guideline to choose the appropriate

decomposition structure for larger multipliers has also been provided. In order

to apply an effective pipelining for Dadda multiplier, a new adder structure

having latched outputs is proposed. The proposed latched adder helps to

reduce the overheads of implementing pipeline structures for a Dadda

multiplier circuit.

The second proposed technique uses bypassing algorithm. The

power dissipation of the circuits employing bypassing algorithm is less, since

the algorithm reduces the spurious switching activities of the circuits.

Bypassing algorithm is tested on different multiplier circuits such as Wallace

multiplier and Dadda multiplier.

All the multiplier circuits are implemented and tested using logic

families such as CMOS, CPL and Hybrid XOR.

52

3.2 RELATED BACKGROUND

Various types of multipliers are discussed in the literature to

achieve power and performance optimization. Column compression

architecture for fast multiplication proposed by Wallace (1964) offers a total

delay which is proportional to the logarithm of the operand word length of the

multiplier. These column compression multipliers are faster than array

multipliers, because delay in an array multiplier varies linearly with the

operand word length. A unique placement strategy for reducing the stage

counters in column compression architecture was proposed by Dadda (1965).

To achieve a compact layout, Goto et al (1992) have proposed a regularly

structured tree multiplier with recurring blocks. Different methods for

compressing the bits in a Wallace tree to achieve improved column

compression are proposed by Oklobdzija and Villeger (1995). Itoh et al

(2001) have proposed a rectangular styled tree multiplier by folding to

achieve a compact layout at the expense of more complicated interconnects.

Wallace multipliers have slightly more area and approximately the same

worst case delay as that of Dadda multipliers in deep submicron technologies

(Andrea Bickerstaff et al, 2001). Wallace tree multipliers based on adiabatic

4-2 compressors proposed by Xien Ye etal (2005) achieve considerable

amount of energy savings but with increased latency.

To reduce the spurious switching in array multiplier designs several

techniques are proposed. Parhami (2000) introduced Carry-Save adders to

minimize the spurious switching. Huang and Ercegovac (2005) have proposed

a method for balancing the signals to the adder at intermediate stage, by

analyzing the signal delay at each stage and thereby deciding the best

connection for the successive stage. Chong et al (2005) have proposed an idea

of introducing latch in adders. Row by passing techniques for array multiplier

53

energy reduction proposed by Obhan et al (2002) eliminate the addition of

zero partial products at the expense of additional shifters and detection

circuitry to align the partial results for accumulation. Column by passing

technique proposed by Wen et al (2005) has a circuit overhead of one

multiplexer per adder cell. Parallel multipliers based on low power adders was

proposed by Rizwan Mudassir and Abid (2005) which offer significant

improvement in speed and power dissipation compared to standard array

multiplier architectures. Senthilpari et al (2007) have analyzed the

performance of 8x8 Carry-Save multipliers using non clocked pass gate

families. High speed Reconfigurable multiplier was proposed by Wei Li et al

(2007) to achieve high performance in block cipher algorithms. Hwang et al

(2007) have proposed enhanced row bypassing schemes for array multipliers

by implementing the multiplexing mechanism using Clocked CMOS

circuitry, in order to resolve the DC power dissipation problem due to voltage

loss in gated signals. Low energy Booth leap frog array multiplier using

dynamic adders is proposed by Chong et al (2007) for low energy and IC area

critical applications. Tzu-Yuan Kuo and Jinn-Shyan Wang (2008) have

proposed a new low voltage latch adder based Wallace tree multiplier. The

multipliers proposed in the existing literature do not offer sufficient

parallelism to minimize the glitch power.

3.3 CONVENTIONAL MULTIPLIERS

Wallace tree, Dadda and Carry-Save multipliers are commonly

available architectures for performing multiplication. Carry-Save multiplier is

derived from an array multiplier. Tree structured multipliers like Wallace and

Dadda fare well, despite their irregularity and excess wiring. This is due to the

fact that tree multipliers offer smaller depth of partial product reduction

54

hardware, which in-turn seems to offset the power loss in wiring. This helps

to reduce the overall power dissipation and delay.

3.3.1 Carry-Save Multiplier

Figure 3.1 shows a 4x4 multiplier implemented using Carry-Save

method. Unlike the normal array multiplier, in Carry-Save multiplier the

output carry bits are propagated diagonally downwards, instead of to the right.

This design will require an extra adder called the vector-merging adder in

order to achieve the final result. It is named so because the carry bits of each

stage are saved to be propagated to the next adder rather than immediate

sideways propagation. For this multiplier there is an increase in the number of

transistors and hence the area occupied also increases.

HA HA HA HA

FA FA FAHA

FAFAFAHA

HA FA FA HA

Vector Merging Adder

Figure 3.1 4x4 Carry-Save Multiplier

3.3.2 Wallace Multiplier

Figure 3.2 shows an 8x8 multiplier proposed by Wallace (1965).

Wallace method employs a three-step process to multiply two numbers:

55

Step 1 : Generate all partial products in parallel using an AND gate array.

Step 2 : If there are ‘p’ rows of partial products, then combine the partial

products such that, 33p

rows of partial products are grouped in

the present stage and the remaining mod 3p rows are passed to the

next stage. This step is repeated until the matrix height is reduced to two-rows.

Step 3 : Sum the two resulting rows with a fast carry propagate adder to

produce the final product.

X0

Y0

X1X2X3X4X5X6X7

Y1Y3Y4Y5Y6Y7 Y2

Y0Y0Y0Y0Y0Y0Y0Y0Y1Y1Y1Y1Y1Y1Y1Y1

Y2Y2Y2Y2Y2Y2Y2Y2Y3Y3Y3Y3Y3Y3Y3Y3

Y4Y5

Y4Y4Y4Y4Y4Y4Y4Y5Y5Y5Y5Y5Y5Y5Y6

Y7

Y6Y6Y6Y6Y6Y6Y6Y7Y7Y7Y7Y7Y7Y7

X0X0

X0X0

X0X0

X0X0

X1X1

X1X1

X1X1

X1X1

X2X2

X2X2

X2X2

X2X2

X3X3

X3X3

X3X3

X3X3

X4X4

X4X4

X4X4

X4X4

X5X5

X5X5

X5X5

X5X5

X6X6

X6X6

X6X6

X6X6

X7X7

X7X7

X7X7

X7X7

Y0X0Y2X7

Y3X0Y5X7

Y6Y7

Y6Y6Y6Y6Y6Y6Y6Y7Y7Y7Y7Y7Y7Y7

X0X0

X1X1

X2X2

X3X3

X4X4

X5X5

X6X6

X7X7

Y0X0Y5X7

Y7X7

Y0X0Y7X7

Y0X0

Z0Z1Z2Z3Z4Z5Z6Z7Z8Z9Z10Z11Z12Z13Z14Z15

X

Figure 3.2 Wallace Multiplier for 8x8 Multiplication

Wallace (1965) showed that the delay for an NxN multiplier can be

reduced to log2N making it faster than array multiplier. Minimizing the delay

56

of the bit-product reduction process is very important in reducing the overall

multiplication time. In Wallace tree multipliers, rows are grouped into sets of

three during each reduction stage. This is done to accomadate, a pseudo-adder

(row comprising of full adders with no carry chain) to add three operand bits

to produce a two bit intermediate result with only a single full adder delay.

Within each three row set, (3,2) counters reduce columns with three bits to

two bits and (2,2) counters reduce columns with two bits to two bits. Rows

that are not part of a three row set are transferred to the next stage without

modification. The height of the matrix in the thj reduction stage, jw is

defined by the following recursive equations (3.1) and (3.2)

0w N (3.1)

1 2. mod 33j jwjw w

(3.2)

3.3.3 Dadda Multiplier

Dadda (1965) generalized and extended Wallace’s results by noting

that a full adder can be thought of as a circuit, which counts the number of

ones in the input and outputs that number in 2-bit binary form. Using such a

counter, Dadda postulated that, at each stage, only minimum amount of

reduction should be done in order to reduce the partial product matrix by a

factor of 1.5. Dadda’s method requires the same number of levels as that of

Wallace method. However Dadda’s method does the minimum reduction

necessary at each level. This results in a design with fewer full adders and

half adders. The number of (3,2) and (2,2) counters required is minimized in

Dadda’s technique compared to Wallace tree. The disadvantage of Dadda’s

method is that it requires a slightly wider fast Carry Propagate Adder (CPA)

and has a less regular structure than Wallace. Figure 3.3 shows a 8x8 Dadda

multiplier.

57

Figure 3.3 Dadda Multiplier for 8x8 Multiplication

The reduction process for a Dadda multiplier is developed using the

following recursive algorithm:

Step 1 : Let 1 2d and 1 1.5j jd d , where jd is the matrix height for the

thj stage from the end. Find the smallest ‘ j ’ such that at least one

column of the original partial product matrix has more than jd bits.

Step 2 : In the thj stage from the end, employ (3,2) and (2,2) counters to

obtain a reduced matrix with no more than jd bits in any column.

58

Step 3 : Let 1j j and repeat step 2 until a matrix with only two rows is

generated.

For a N N bit Dadda multiplier, there are 2N bits in the original

partial product matrix and 4 3N bits in the final two row matrix. The total

number of (3,2) counters required is given by 2 4 3N N . The length of the

carry propagate adder (CPA) is given by 2 2N . The total number of (2,2)

counters required is given by 1N .

3.4 LOGIC FAMILIES CONSIDERED FOR DESIGNING

MULTIPLIERS

Three Logic families namely Complementary Metal Oxide

Semiconductor (CMOS), Complementary Pass-transistor Logic (CPL) and

Hybrid XOR have been chosen for implementing the various multipliers.

3.4.1 CMOS Logic

CMOS logic supports an efficient implementation of the design

with high noise margin and low static power consumption. Output logic level

does not depend on the transistor size and it is a ratioless logic. The rise times

and fall times of the inputs are controlled, tending to be ramps rather than step

functions. The disadvantage lies in the large PMOS transistors which result in

high input and internal capacitances. Also, area requirements are large, i.e., if

there are N inputs, 2N transistors (N each for PUN and PDN) are needed.

Moreover, a weak output driving capability is caused by series transistors.

The basic CMOS full adder implementation is shown in Figure 3.4.

59

A B B

Ci

A

A B B

A

A B Ci

A B Ci

A

B

Ci

Ci

B

A

VDD VDDVDD

CO SO

Figure 3.4 Full Adder using CMOS Logic (Static Mirror Adder)

3.4.2 Complementary Pass-Transistor Logic (CPL)

Figure 3.5 shows a Full adder implemented using CPL. CPL

benefits from the small input capacitances (NMOS network only), fast

differential stage, reduced number of transistors (N transistors for N inputs)

and good output driving capability, making the implementation of complex

gates very efficient. Usually, level restorer circuits (PMOS transistors) are

necessary for swing restoration. The switching speed is very high because of

low threshold voltage. However, this leads to difficulty in switching off these

zero threshold devices. Also, the large number of nodes and transistors and

the two inversion levels result in inefficiency. It has larger short-circuit

currents, higher wiring overhead and increased power consumption compared

to CMOS. The CPL adder shown in Figure 3.5 requires seven inverters to

generate the complement signals. However, when this adder is used in designs

such as multiplier, the input complementary signals can be derived from the

previous stage outputs. This reduces the transistor count. Also the drivability

of the adder is fairly good even without the use of inverters at the output. This

is due to the presence of PMOS pull up transistors. Therefore in complex

60

designs such as multipliers, the output inverters for generating sum and carry

can be used in the alternate stages of the design, thereby improving speed and

reducing area.

A

Bb

Ab

B

A

Ab

Bb

B

F

Fb

C

Cb

Cb

C

S

Sb

F

F

Fb

Fb

C

A

Cb

Ab

Cob

Co

F

F

Fb

Fb

Figure 3.5 Full Adder using CPL

In the proposed work, the inverters and half adders used in the CPL

implementation were also designed using CMOS logic to yield better output

results. CPL causes reduced voltage swing. Hence buffers have been inserted

to restore the logic levels wherever necessary. This has led to slightly higher

power dissipation.

3.4.3 Hybrid XOR Logic

Hybrid XOR proposed by Chang et al (2005) is the combination of

various logic styles for achieving an optimized structure. It comprises of a

61

CPL, Transmission gate (TG) logic and CMOS logic. The adder implemented

in this logic style suits tree structured arithmetic units. The schematic is

shown in Figure 3.6.

A

B

C

S

Co

Figure 3.6 Full Adder using Hybrid XOR Logic

It is designed by optimizing three modules. The first module

generates the XOR and XNOR logic signals of the adder inputs. The sum and

carry are then generated using these signals. The structure is balanced as the

sum and carry outputs are generated simultaneously.

3.5 PROPOSED DECOMPOSITION ALGORITHM

In this thesis, a new technique to implement digital multipliers

using the decomposition logic is presented. Here, the multiplication process is

decomposed into smaller sub-units (smaller multipliers) and their outputs are

62

combined to get the final result. By doing so, parallel processing is also

introduced in addition to the benefits from the structured implementation of

the multiplier. In the first stage, the N N partial products will be split up into

four 2 2N N multiplier blocks. The outputs from these blocks are then combined

in a tree like fashion to get final results. The 2 2N N multiplier is implemented

using Carry-Save structure for decomposed Carry-Save multipliers, using a

Wallace structure for decomposed Wallace multiplier, and using a Dadda

structure for decomposed Dadda multiplier. The decomposition logic requires

extra circuitry to perform final addition of outputs obtained from 2 2N N

multipliers. However, due to parallel processing of the 2 2N N multipliers,

significant improvement in speed is achieved. Since the inputs to the final

adder circuitry arrive in parallel, glitches are reduced resulting in less power

dissipation. Further the decomposition process is extended with 4 4N N

multiplier blocks. The benefits derived from parallel processing of data are

outweighed by degradation due to extra logic circuitry, if the decomposition

process is progressively extended. This occurs because a large number of

intermediate result bits are generated in each column. These intermediate

result bits need to be combined in a tree like fashion by grouping them into

sets of three and sets of two. Hence several levels of reduction is required,

before deducing the number of rows to two.

Figure 3.7 shows the dot diagram of a decomposed 8x8 multiplier

structure. A 8x8 multiplier has eight rows of partial products, each row having

eight terms. In the first stage, the partial products are grouped into four 4x4

multiplier blocks as shown in Figure 3.7. Each 4x4 multiplier block will

generate 8-bits of output.

63

Figure 3.7 Decomposition Structure for 8x8 Multiplication

Hence in the second stage the partial product matrix present in the first stage is reduced to four rows, each row containing 8 bits. In the second stage, the columns containing single bit of information are transferred to the third stage as such and columns containing three bits of information are compressed to two bits of information namely sum and carry using a full adder and then transferred to the third stage. The columns transferred as such and sum bits of the full adder output of second stage are laid as first row in the third stage. The carry bits of the full adder are shifted to the next column and arranged as second row in the third stage. The second row dots are connected with the first row dots by a diagonal line indicating that they are the carry bits generated for the full adders in the second stage. In the third stage the two rows are compressed using a fast adder to generate the final result. Figure 3.8 shows the dot diagram of a decomposed 16x16 multiplier structure using 8x8 multipliers. In the first stage, four 8x8 multiplier blocks are used to combine all the partial products. The outputs from these 8x8 multipliers are then combined in a treelike fashion in a similar manner to produce the final results.

64

8x8 Multiplier

STAGE 1

STAGE 2

Figure 3.8 Decomposition Structure for 16x16 Multiplication using 8x8

Multipliers

Figure 3.9 represents the dot diagram of decomposed 16x16

multiplication process using 4x4 multipliers. It is observed that more number

of addition stages are required to generate final result. This results in excess

hardware, which may account for increased delay and power consumption,

when compared to a 16x16 multiplication performed using 8x8

decomposition.

65

Figure 3.9 Decomposition Structure for 16x16 Multiplication using 4x4

Multipliers

3.6 PROPOSED DESIGN OF A 8X8 PIPELINED DADDA

MULTIPLIER

Pipelined circuits can be constructed by using level sensitive

latches at the output of intermediate stages. Pipelining is a popular design

technique often used to accelerate the operation of data paths in DSPs. Two

pipelined multiplier structures are presented and their performances are

compared. A pipelined multiplier is implemented with the proposed latched

CPL adder and its performance is compared with the pipelined multiplier

structure constructed using a static latch proposed by Uming Ko and Poras

66

Balsara (2000). This latch is named as ‘PowerPC latch’ and is shown in

Figure 3.10. It uses a transmission gate controlled by a clock signal at the

input. The feedback path consists of an inverter and a transmission gate

combined together to reduce power dissipation.

To reduce the overheads (transistor count and power dissipation) of

implementing pipelined multiplier design, a latched adder is proposed by

modifying the CPL adder. The latched CPL adder is shown in Figure 3.11.

The latch portion of the adder is derived from a two phase CPL flip-flop

structure. The structure is pseudo-static and requires only single phase

clocking as opposed to the two phase clocking required for the PowerPC

latch. The latched version of the CPL adder requires only two extra transistors

when compared to the CPL adder. When the PowerPC latch is used at the

output of the adder, 10 transistors are needed. Hence, 8 transistors are saved

by using the Latched CPL adder as compared to PowerPC latch.

CLK

CLKB

CLK

CLKB

D Q

Qb

Figure 3.10 PowerPC Latch

67

A

Bb

Ab

B

A

Ab

Bb

B

F

Fb

C

Cb

Cb

C

S

Sb

F

F

Fb

Fb

C

A

Cb

Ab

Cob

Co

F

F

Fb

Fb

CLK

CLK

Figure 3.11 Latched CPL Adder

Pipelining technique is applied for 8x8 Dadda multiplier

implemented using decomposition logic. The pipelined structure is as shown

in Figure 3.12. Two structures are designed – one using the latched CPL

adder and the other using PowerPC latch. The latched CPL adder is used in

the final addition stage of 4x4 Dadda multiplier, while the PowerPC latch is

used at the outputs from the 4x4 Dadda multiplier. All the other adders used

in the pipelined multiplier were the CPL adder without latch.

68

Figure 3.12 Pipeline Structure

3.7 PROPOSED BYPASSING ALGORITHM FOR WALLACE AND DADDA MULTIPLIERS

In the proposed Wallace and Dadda multiplier architectures, the first step involves the generation of partial products using an AND array. In the second step the partial products are grouped into sets according to the conventional Wallace and Dadda algorithms respectively. Modification of the architecture is proposed in the implementation of (3,2) and (2,2) counters which are key members of partial product reduction hardware. Figure 3.13 shows the circuit diagram of the tri-state buffer.

cs

O/PI/P

Figure 3.13 Tri-state Buffer

Tri-state buffer is constructed using transmission gate logic. The control signal for the tri-state buffer is ‘cs’. The architecture of the proposed

69

(3,2) counter is shown in Figure 3.14. The proposed (3,2) counter has a full adder, OR gate, two tri-state buffers and two (2x1) multiplexers. The proposed (3,2) counter has three inputs named as ‘a’, ‘b’ and ‘ci’ and two outputs named ‘s’ and ‘c’. The inputs operands ‘a’ and ‘b’ are passed through a tri-state buffer to excite the input terminals of the (3,2) counter. The control signal ‘cs’ for two tri-state buffers and the two (2x1) multiplexers is derived by performing logic ‘or’ operation on the inputs ‘a’ and ‘b’. The control signal is high when atleast one of the operand bits are high and the control signal is low when both the operand bits are low. When the control signal ‘cs’ is high, the input operands ‘a’ and ‘b’ propagate through the buffer and stimulate the input terminals of the full adder. The full adder cell is activated and addition of the bits ‘a’, ‘b’ and ‘ci’ is performed. This yields the sum output ‘s1’ and the carry output ‘c1’ of the full adder.

Full Adder

c1 s1

0 1

c s

Vss

0 1

cs

a

b

I II

a

b ci

Figure 3.14 Proposed (3,2) Counter

70

These outputs ‘s1’ and ‘c1’ then propagate through the (2x1)

multiplexers labeled ‘I’ and ‘II’ respectively to produce the sum output ‘s’

and carry output ‘c’ of the (3,2) counter. When the control signal ‘cs’ is low,

the tri-state buffers are open circuited which in turn causes the full adder cell

to get deactivated. Hence the third input operand ‘ci’ of the proposed (3,2)

counter is directly routed through the (2x1) multiplexer labeled as ‘I’ to

generate the sum output ‘s’ and logic ‘0’ is routed through the (2x1)

multiplexer labeled as ‘II’ to generate the carry output ‘c’ of the (3,2)

counter.

Half Adder

c1 s1

0 1

c s

Vss

0 1

cs

a b

I II

a

Figure 3.15 Proposed (2,2) Counter

The architecture of the proposed (2,2) counter is shown in

Figure 3.15. The proposed (2,2) counter has one half adder, one tri-state

buffer and two multiplexers. The proposed (2,2) counter has two input

operands ‘a’ and ‘b’ and two outputs named ‘s’ and ‘c’. The input operand

71

‘a’ also acts as a control signal for the tri-state buffer and the two (2x1)

multiplexers. When the control signal ‘cs’ is high, input operand ‘a’

propagates through the tri-state buffer and stimulate the input terminals of

half adder cell. The half adder is enabled and it performs addition of two bits

‘a’ and ‘b’. This results in generation of sum output ‘s1’ and carry output ‘c1’

for the half adder. These outputs ‘s1’ and ‘c1’ then propagate through the

(2x 1) multiplexers labeled ‘I’ and ‘II’ respectively to produce the sum output

‘s’ and carry output ‘c’ of the (2,2) counter. When the control signal ‘cs’ is

low, the tri-state buffer is open circuited which in turn deactivates the half

adder cell. Hence the second input operand ‘b’ of the proposed (2,2) counter

is directly routed through the (2x1) multiplexer labeled as ‘I’ to generate the

sum output ‘s’ and logic ‘0’ is routed through the (2x1) multiplexer labeled as

‘II’ to generate the carry output ‘c’ of the (2,2) counter.

The partial products are reduced in a progressive manner and

finally two rows are deduced. Final step involves the addition of the two rows

using a fast carry propagate adder.

3.8 SIMULATION

Simulation for the multiplier designs was done using Tanner EDA

tool. The parameters considered for evaluating the proposed multiplier

structures are power, delay, power-delay product and transistor count.

3.8.1 Simulation Results for Decomposition Algorithm

The proposed decomposition algorithm is tested on Carry-Save,

Wallace and Dadda multipliers.

72

3.8.1.1 Results of carry-save multipliers

The Carry-Save multiplier circuits were simulated using TSMC

180 nm technology. The threshold voltages of NMOS and PMOS transistors

are kept as 0.39 V and -0.41 V respectively. The supply voltage is set to

1.8 V for all modules with rise and fall times of the input set to 0.10 ns.

Tables 3.1 and 3.2 list a comparative study on Carry-Save multipliers with

and without decomposition for various performance parameters.

Table 3.1 Results for 8x8 Carry-Save Multiplier with and without

Decomposition

Performance Parameters

8x8 Carry-Save Multiplier without Decomposition

8x8 Carry-Save Multiplier using 4x4 Decomposition

CMOS HYBRID

XOR CPL CMOS HYBRID

XOR CPL

Average power (mw)

0. 4504 0.62 1.024 0.4254 0.479 1.069

Delay (ns) 1 1.8 1.571 0.857 1.67 1

Power delay Product

( pico Joules) 0.4504 1.116 1.6171 0.3643 0.7999 1.069

Transistor count 2688 1888 2192 3072 2880 2659

73

Table 3.2 Results for 16x16 Carry-Save Multiplier with and without

Decomposition


16x16 Carry-Save Multiplier without

Decomposition

16x16 Carry-Save Multiplier using 8x8

Decomposition

CMOS HYBRID

XOR CPL CMOS

HYBRID XOR

CPL

Average power (mw) 3.90 8.15 8.17 3.35 7.70 7.59

Delay (ns) 5.192 3.327 4.091 3.076 2.702 2.692

Power-Delay Product

(pico Joules) 20.248 27.115 33.405 10.303 20.805 20.432

No. of Transistors

13776 7892 11780 11520 12800 8992

From Tables 3.1 and 3.2 it is inferred that Carry-Save multipliers

implemented using decomposition logic have reduced power consumption

and delay compared to Carry-Save multipliers implemented without

decomposition. This is due to the fact that decomposed multiplier blocks

possess parallelism in computation and also the output signals from the

partitioned blocks have same arrival times. Since the signals for the next stage

arrive at same time, the glitches will get eliminated. This inturn accounts for

reduced power dissipation as well. The transistor count for the 8x8 Carry-

Save multiplier implemented using 4x4 decomposed blocks has increased

compared to a 8x8 Carry-Save multiplier implemented without

decomposition. This indicates that there is a marginal area overhead.

74

Figures 3.16 to 3.18 show the comparison graph of delay, power

and power-delay product for a 8x8 Carry-Save multiplier without

decomposition and 8x8 Carry-Save multiplier using 4x4 decomposition.

Figure 3.16 Delay Comparison of Carry-Save Multiplier with and

without Decomposition

Figure 3.17 Power Comparison of Carry-Save Multiplier with and


75

Figure 3.18 Power-Delay Product Comparison of Carry-Save Multiplier with and without Decomposition

3.8.1.2 Results of wallace tree multipliers

Wallace Tree multipliers are simulated using TSMC 180 nm technology. For 180 nm technology, the threshold voltages of NMOS and PMOS transistors are kept as 0.39 V and -0.41 V respectively. The supply voltage is set to 1.8 V for all modules with rise and fall times of the input set to 0.10 ns.

The Wallace multipliers with and without decomposition are also simulated and tested for supply voltage variations and technology variations. The length and width specifications for 180 nm technology is as follows: NMOS: L=180 nm and W=270 nm; PMOS: L=180 nm and W=810 nm. The length and width specifications for 130 nm technology is as follows: NMOS: L=130 nm and W=195 nm; PMOS: L=130 nm and W=585 nm. For TSMC 130 nm technology, threshold voltages of NMOS and PMOS transistors are around 0.332 V and -0.3499 V respectively. The input patterns were switched at a frequency of 50 MHz. The rise and fall times of the input is set to 0.10

76

ns. Tables 3.3 and 3.4 show a comparative study on Wallace multipliers with and without decomposition for various performance parameters.

Table 3.3 Simulation Results of 8x8 Wallace Multiplier with and



8x8 Wallace Multiplier without Decomposition

8x8 Wallace Multiplier using 4x4 Decomposition

CMOS HYBRID XOR CPL CMOS HYBRID

XOR CPL

Average power (mw) 0. 911 0.765 0.342 0. 431 0.560 0.321

Delay (ns) 1.05 1.3 1.39 1.05 1.10 0.98 Power delay

Product (pico Joules)

0.956 0.994 0.475 0.452 0.616 0.314

Transistor count 2492 3102 2560 2292 2902 2120

Table 3.4 Simulation Results of 16x16 Wallace Multiplier with and



16x16 WALLACE Multiplier without

Decomposition

16x16 WALLACE Multiplier using 8x8

Decomposition

16x16 WALLACE Multiplier using 4x4

Decomposition

CMOS HYBRID XOR CPL CMOS HYBRID

XOR CPL CMOS HYBRID CPL

Average power (mw) 1.943 9.231 5.32 1.275 7.801 3.424 1.515 8.799 4.168

Delay (ns) 3.3 2.1 2.4 3.1 1.8 1.64 3.6 1.9 1.897 Power-Delay

Product (pico Joules)

6.411 19.38 12.76 3.952 14.04 5.615 5.454 16.718 7.906

Transistor Count

11434 13214 9834 10164 12854 9664 11700 15126 12768

From Tables 3.3 and 3.4 it is observed that power-delay product for the Wallace multipliers implemented using decomposition process is least compared to Wallace multipliers implemented without decomposition for all

77

logic families considered. This is due to the effect of parallel processing encountered in computation of the partitioned blocks in first stage. Further it can be observed that 16x16 multipliers implemented using 8x8 decomposed blocks have lesser power-delay product compared to 16x16 multipliers implemented using 4x4 decomposed blocks. This is because decomposed structure using 4x4 partitioned blocks yield sixteen rows of intermediate partial product compared to only four rows of reduced intermediate partial product after the first stage. This increase in the number of rows present in the intermediate partial product causes more number of stages of computation to compress these rows suitably using adders to generate the final result. Hence for 16x16 multipliers implemented using 4x4 blocks the hardware requirement is more which is indicated directly by the transistor count. Further the critical path delay and the power consumed will increase for 16x16 Wallace multipliers implemented using 4x4 partitioned blocks. As a generalization it can be said that N N multipliers implemented using

2 2N N multiplier blocks possess the least power delay product for all logic

families. Figures 3.19 and 3.20 show the power-delay product comparison of 8x8 and 16x16 Wallace tree multipliers with and without decomposition.

Figure 3.19 Power-Delay Product Comparison of 8x8 Wallace Tree

Multiplier with and without Decomposition

78

Figure 3.20 Power-Delay Product Comparison of a 16x16 Wallace Tree

Multiplier with and without Decomposition

Tables 3.5 to 3.8 list a comparative study on Wallace multipliers

with and without decomposition for various performance parameters based on

supply voltage variations and technology variations.

Table 3.5 Results of 8x8 Decomposed Multipliers for Supply Voltage

Variations in 180 nm Technology

Performance Parameter

Supply Voltage (volts)

Simulation Results for a 8 x8 Wallace Multiplier

Proposed Decomposition Algorithm (Using 4x4 Wallace Multipliers)

Without Decomposition

CMOS CPL Hybrid XOR CMOS CPL Hybrid

XOR

Delay (ns) 1.8 0.84 1.16 0.85 1.22 1.6 1.12 1.6 0.98 1.32 1.39 1.67 1.92 1.79

Power (mW) 1.8 0.390 0.174 0.72 0.467 0.299 0.96 1.6 0.146 0.138 0.407 0.205 0.210 0.53

Power-Delay Product

(pico Joules)

1.8 327.6 201.84 612 569.74 478.4 1075.2

1.6 143.08 182.16 565.73 342.35 403.2 948.7

79





Simulation Results for a 8 x8 Wallace Multiplier

Proposed Decomposition

Algorithm (Using 4x4 Wallace Multipliers)



XOR

Delay (ns) 1.3 0.67 1.74 1.09 0.93 2.78 1.52 1.1 0.85 1.98 1.25 1.25 3.26 2.03

Power (µW) 1.3 57.34 73.18 130.3 80.32 117.6 279.8 1.1 39.10 46.77 83.92 55.09 65.73 172.2

Power-Delay Product

(pico Joules)

1.3 38.41 127.33 142.02 74.69 326.92 425.29

1.1 33.23 92.604 104.9 68.86 214.27 349.56





Simulation Results for a 16x16 Wallace Multiplier Proposed Decomposition Algorithm


Using 8x8 Wallace Multipliers



XOR CMOS CPL Hybrid XOR

Delay (ns) 1.8 1.17 2.27 1.94 1.42 2.982 2.44 1.69 3.70 3.06 1.6 1.84 2.52 2.17 2.35 3.34 3.03 2.52 4.12 3.79

Power (mW) 1.8 1.25 1.645 6.149 1.52 1.634 6.79 1.72 1.597 7.01 1.6 0.739 1.132 2.020 0.869 1.22 2.451 0.942 1.146 2.772

Power-Delay Product

(pico Joules)

1.8 1.465 3.7342 11.929 2.154 4.872 16.567 2.906 5.908 21.45

1.6 1.3596 2.8526 4.383 2.042 4.074 7.4265 2.37 4.7215 10.50

80





Simulation Results for a 16x16 Wallace Multiplier Proposed Decomposition Algorithm

Without Decomposition Using 8x8 Wallace Multipliers



XOR CMOS CPL Hybrid XOR

Delay (ns) 1.3 1.08 1.51 2.21 1.24 1.78 2.77 1.52 2.45 3.12

1.1 1.52 1.93 2.42 1.73 2.47 2.98 2.06 3.12 3.98

Power (mW) 1.3 0.284 0.644 0.628 0.334 0.569 0.775 0.487 0.671 0.788

1.1 0.195 0.391 0.411 0.228 0.326 0.507 0.293 0.462 0.464

Power-Delay Product

(pico Joules)

1.3 306.72 972.4 1387.88 414.16 1012.8 2146.75 740.24 1643.9 2458.56

1.1 296.4 754.63 994.62 394.4 805.22 1510.86 603.5 1441.4 1846.72

From Tables 3.5 to 3.8, it is observed that the 8x8 Wallace tree multipliers implemented using 4x4 decomposition and 16x16 Wallace tree multipliers implemented using 8x8 decomposition have the least power-delay product in all the cases. It can be concluded that same trend of results is achieved for the Wallace multipliers implemented with decomposition for supply voltage variations and technology variations.

3.8.1.3 Results of Dadda multipliers

Dadda multipliers are simulated using TSMC 180 nm technology. The threshold voltages of NMOS and PMOS transistors are kept as 0.39 V and -0.41 V respectively. The supply voltage is set to 1.8 V for all modules with rise and fall times of the input set to 0.10 ns. To account for process variation, Dadda Multiplier circuits were further tested at different supply voltages ranging from 1.0 V to 1.8 V. The two pipelined structures were then compared for their power dissipation values and number of transistors used.

81

Table 3.9 Results for 8x8 Dadda Multiplier with and without Decomposition

8x8 Dadda Multiplier Designed using CPL Adder Performance Comparison of 8x8 Dadda Multiplier

Supply voltage

(V)

Power (µW) Delay (ns) Critical Path Delay

Improvement Savings %

Power-delay product ( X 10-15 Joules)

Power –Delay

Product Savings %

Using 4x4 Decomposition






1.8 567 569 1.12 1.51 26.19 635.04 859.19 26.19

1.5 184 189 1.45 1.92 26.48 266.80 362.88 26.48

1.2 112 117 2.51 3.23 25.61 281.12 377.91 25.61

1.0 76.7 80.8 4.00 5.19 26.83 306.80 419.352 26.83

Transistor Count

Using 4x4 Decomposition Without Decomposition

1648 1476

82

Table 3.10 Results for a 16x16 Dadda Multiplier with and without Decomposition

16x16 Dadda Multiplier Designed using CPL Adder Performance Comparison

Supply voltage

(V)

Power (mW) Delay (ns) Power-Delay Product ( X 10-12 Joules)

Power-Delay Product Savings with respect

to Without Decomposition

in %


Decomposition Process Without

Decomposition

Decomposition Process Without

Decomposition

Decomposition Process For 8x8

Dadda

For 8x8 Decomposed

Structure Using 8x8

Dadda

Using 8x8 Decomposed

Structure

Using 8x8

Dadda

Using 8x8 Decomposed

Structure

using 8x8

Dadda

using 8x8 Decomposed

Structure 1.8 2.696 2.774 2.547 1.71 1.41 1.54 4.61016 3.91134 3.92238 15.15 14.91 1.5 0.890 0.933 0.862 2.85 2.00 2.51 2.3585 1.866 2.16362 20.88 8.26 1.2 0.533 0.569 0.516 5.46 3.18 4.14 2.9101 1.80942 2.13624 37.82 26.59 1.0 0.183 0.196 0.178 8.71 5.05 6.63 1.59393 0.9898 1.18014 35.69 25.96

Transistor Count

Without Decomposition Decomposition Process

Using 8x8 Dadda Using 8x8 Decomposed Structure 6762 6792 7480

83

The simulation results of 8x8 Dadda multiplier and 16x16 Dadda

multiplier with and without decomposition for power supply variations are

summarized in Tables 3.9 and 3.10 respectively. In Table 3.9, the results for

two types of decomposition are listed. They are decomposition based on (8x8

Dadda) and decomposition based on (8x8 Dadda multiplier implemented

using 4x4 decomposition), termed as 8x8 decomposed structure.

It is observed from Table 3.9 that, for the 8x8 multiplier structure,

the decomposition logic shows an improvement of 22% to 25% in delay

compared to Dadda’s method due to parallel processing of data. The power

dissipation is slightly less than that of the Dadda structure due to reduction in

glitches in spite of the extra logic circuitry. The power-delay product is

reduced by about 25% to 27%. From Table 3.10, it is also inferred that the

delay of the 16x16 Dadda multipliers implemented using decomposition

process is less compared to 16x16 Dadda multiplier implemented without

decomposition. This is because decomposition process incorporates the effect

of parallel processing.

The 16x16 Dadda multiplier implemented by decomposition using

8x8 Dadda partitioned blocks is faster compared to the other decomposed

structure. This is due to the fact that 8x8 decomposed structure require more

number of stages of computations after parallel processing to achieve the final

result. It is observed that about 17% to 42% improvement in speed can be

achieved for 16x16 Dadda multiplier implemented using decomposition (8x8

Dadda) compared to 16x16 Dadda multiplier implemented without

decomposition. Further, a reduction of about 15% to 35% can be achieved in

power-delay product for 16x16 Dadda multiplier implemented using

decomposition (8x8 Dadda) compared to 16x16 Dadda multiplier

implemented without decomposition. Similarly about 8% to 26%

improvement in power-delay product can be obtained for 16x16 Dadda

84

multiplier implemented using decomposition (8x8 Decomposed Structure)

compared to 16x16 Dadda multiplier implemented without decomposition.

The simulation results for the two pipelined Dadda multiplier

structures are shown in Table 3.11. It can be observed that the latched CPL

adder reduces the overhead for pipelined structures compared to the use of

separate latches for pipelined multiplier design.

Table 3.11 Simulation results for 8x8 Pipelined Dadda Multiplier

structures

Power Results Supply Voltage

(V) Latched CPL Adder (µW)

PowerPC Latch (µW)

Savings %

1.8 652 720 9.444

1.5 422 470 10.21

1.2 125 142 11.97

1.0 85.6 98.5 13.09

Transistor Count

No. of Transistors

Latched CPL Adder

PowerPC Latch

Savings %

1840 1976 6.882

3.8.2 Simulation Results for Bypassing Algorithm

The average power consumed, worst case delay and power-delay

product for 4x4 and 16x16 Wallace multipliers with and without bypassing of

Partial Products is listed in Tables 3.12 and 3.13 respectively.

85

Table 3.12 Simulation Results for 4x4 Wallace Multiplier with and

without Bypassing


Logic Family Used for Adder Implementation CMOS CPL Hybrid XOR

Without Bypassing

With Bypassing

Without Bypassing

With Bypassing

Without Bypassing

With Bypassing

Average Power Consumed (mW)

0.523

0.48 1.57 1.25 2.94 2.66

Delay (ns) 0.65 0.39 0.56 0.27 0.25 0.18 Power-Delay

Product (pico Joules )

0.339 0.187 0.879 0.337 0.735 0.478

Table 3.13 Simulation Results for 16x16 Wallace Multiplier with and

without Bypassing



Without Bypassing

With Bypassing

Without Bypassing

With Bypassing

Without Bypassing

With Bypassing

Average Power Consumed (mW) 1.943 1.802 5.32 4.22 9.231 8.78

Delay (ns) 3.3 2.75 2.4 1.6 1.1 0.9 Power-Delay


6.419 4.995 12.768 6.752 10.16 7.902

Figures 3.21 and 3.22 indicate the power-delay product comparison

of 4x4 and 16x16 Wallace tree multipliers with and without bypassing

respectively

86

Figure 3.21 Power-Delay Product Comparison of a 4x4 Wallace

Multiplier with and without Bypassing

Figure 3.22 Power-Delay Product Comparison of a 16x16 Wallace


87

From Figures 3.21 and 3.22, it is observed that a large improvement

in power-delay product occurs, when the proposed bypassed Wallace

architectures are implemented using CPL logic family. Hybrid XOR family

for the proposed bypassed Wallace architectures possess the least delay. The

average power consumed, worst case delay and power-delay product for 4x4

and 16x16 Dadda multipliers with and without bypassing of partial products

is listed in Tables 3.14 and 3.15 respectively.

Table 3.14 Simulation Results for 4x4 Dadda Multiplier with and

without Bypassing



Without Bypassing

With Bypassing

Without Bypassing

With Bypassing

Without Bypassing

With Bypassing


Delay (nS) 0.57 0.42 0.44 0.22 0.36 0.18 Power-Delay


0.433 0.273 0.7348 0.288 0.918 0.433

Table 3.15 Simulation Results for 16x16 Dadda Multiplier with and

without Bypassing



Without Bypassing

With Bypassing

Without Bypassing

With Bypassing

Without Bypassing

With Bypassing


Delay (nS) 1.76 1.49 1.52 1.43 1.1 0.9 Power-Delay


6.688 5.22 7.736 6.034 10.021 7.902

88

Figures 3.23 and 3.24 show the power-delay product comparison

for 4x4 and 16x16 Dadda multipliers with and without bypassing

respectively.

Figure 3.23 Power-Delay Product Comparison of a 4x4 Dadda


Figure 3.24 Power-Delay Product Comparison of a 16x16 Dadda


89

Tables 3.14 and 3.15 reveal that CPL family offers largest

improvement in power-delay product, Hybrid XOR family offers least delay

for the proposed Dadda multiplier architectures. Bypassing technique is most

effective when the input operand bits have more number of zeros than ones.

This is because when there are more number of operand bits with value logic

‘0’, more number of (3,2) and (2,2) counters will operate in bypass mode.

This yields reduced delay and reduced power because computation units are

deactivated. Hence we achieve a large reduction in power-delay product.

Table 3.16 lists the comparison of the Proposed 8x8 Carry-Save

multiplier using 4x4 decomposition algorithm with recent related work in the

literature.

Table 3.16 Performance Comparison of the Proposed 8x8 Carry-Save

Multiplier with Related Work in the Literature

CPL Adder used for Implementing all the Multipliers in 180 nm Technology with Supply Voltage 1.8 V

Parameter

8x8 Carry-Save

Multiplier by Senthilpari et al (2007)

8x8 Multiplier Proposed by Rizwan Mudassir and Abid (2005)

Proposed 8x8 Carry-Save Multiplier using 4x4

Decomposition

Regular Array

Multiplier

Architecture -I

Architecture -II

Power (mW) - 2.63 1.842 1.416 1.069

Delay (ns) 1.4138 1.298 1.167 1.159 1

From Table 3.16 it is evident that the Proposed 8x8 Carry-Save

multiplier achieves about 24.48% reduction in power compared to 8x8 Carry-

Save multiplier proposed by Senthilpari et al (2005). The Proposed 8x8

Carry-Save multiplier implemented using 4x4 decomposition offers about

22.95%, 14.31% and 13.71% improvement in speed compared to regular

array multiplier, Architecture-I and Architecture-II respectively. Moreover,

90

about 59.39%, 41.96% and 24.5% power savings is achieved for the Proposed

8x8 Carry-Save multiplier compared to regular array multiplier, Architecture-

I and Architecture-II respectively.

Tables 3.17 and 3.18 list out the comparison of the Proposed

decomposition based Wallace and Dadda multipliers with related work in the

literature.

Table 3.17 Comparison of the Proposed Wallace Multipliers with

Previous Work in the Literature

All the Multipliers in 180 nm Technology with Supply Voltage 1.8 V

Parameter 8x8 Wallace Multiplier

Proposed by Andrea et al (2001)

Proposed 8x8 Wallace Multiplier using 4x4

Decomposition with CPL adder Delay (ns) 1.6 0.98

All the Multipliers in 180 nm Technology with Supply Voltage 1.8 V

16x16 Reconfigurable Multiplier Proposed by Wei Li et al (2007)

16x16 Wallace Multiplier

Proposed by Andrea

et al (2001)

Proposed 16x16 Wallace

Multiplier using 8x8

Decomposition with CPL adder

Design A Design B

Delay (ns) 3.5 2.8 2 1.64

From Table 3.17 it is inferred that about 38.75% improvement in

speed is achieved for the proposed 8x8 multiplier compared to 8x8 Wallace

multiplier proposed by Andrea et al (2001). Further about 18% and 41.4%

improvement in critical path delay is observed for proposed 16x16 multiplier

compared to 16x16 Wallace multipliers proposed by Andrea et al (2001) and

Design B of Wei Li et al (2007) respectively.

91

Table 3.18 Comparison of the Proposed Dadda Multiplier with Previous

work in the Literature


Multiplier Implementations in 180 nm Technology 8x8 Dadda Multiplier

Proposed by Andrea etal (2001)

Proposed 8x8 Dadda Multiplier % Improvement

Delay (ns) 1.6 1.12 30 16x16 Dadda Multiplier

Proposed by Andrea etal (2001)

Proposed 16x16 Dadda Multiplier using 8x8 Dadda

% Improvement

Delay (ns) 1.9 1.41 25.78

3.9 CONCLUSION

A new technique of implementing digital multipliers using

decomposition logic is proposed. The 8x8 Carry-Save multiplier implemented

using decomposition technique offers 14.3% improvement in speed for

CMOS logic family, 7.22% improvement in speed for Hybrid XOR logic

family and 36.34 % improvement in speed for CPL logic family. The power-

delay product in the 8x8 decomposed Carry-Save multiplier is 19.11% lesser

for CMOS logic family, 33.88% lesser for CPL logic family and 28.32%

lesser for Hybrid XOR logic family. This reduction in power-delay product

may be greatly useful in areas like embedded systems and signal processing.

The 16x16 multipliers implemented using 4x4 decomposed structures

introduce excess delay in CMOS logic when compared to other families. This

is due to the fact that there are more transistors in the critical path.

When compared to the Carry-Save, Wallace and Dadda multipliers,

the proposed multipliers were faster and energy efficient in spite of extra logic

circuitry. It is clearly observed that N N multipliers decomposed and

92

implemented using 2 2N N multiplier blocks perform well both in terms of

power and delay. Multiplier implemented using 4 4N N decomposed blocks

introduce a significant amount of latency and power consumption. This is due

to large number of stages involved to combine the intermediate results to

produce the final result. A pipelined implementation of the decomposition

multiplier structure has been presented, using a new concept of adders having

latched outputs which reduces the overhead costs in pipelined

implementations. Proposed bypassing technique for Wallace and Dadda

multipliers yields a significant amount of power savings.

Proposed bypassing algorithm for a Wallace tree multiplier

provides about 7% to 8% power savings for CMOS logic family, 18% to

20% power savings for CPL logic family and about 4% to 9% power savings

for Hybrid XOR logic family. Proposed bypassing algorithm for a Dadda tree

multiplier can achieve about 8% to 14.5% power savings for CMOS logic

family, 17% to 21.5% power savings for CPL logic family and about 3% to

5% power savings for Hybrid XOR logic family. The results reveal that CPL

logic family yields the largest reduction in power-delay product, about nearly

50% improvement for bypassed structures when compared to CMOS and

Hybrid XOR families. Also it is possible to achieve 20% to 50%

improvement in speed. This large improvement in power and delay is

achieved at the expense of small increase in area overhead that is accounted

by the buffers and the multiplexers which are inserted in the design for each

full adder and half adder. Bypassing algorithm is more effective when the

operand bits have more number of zeros than ones.

chapter 3 high speed multiplier design -...

Documents