chapter 3 high speed multiplier design -...
TRANSCRIPT
51
CHAPTER 3
HIGH SPEED MULTIPLIER DESIGN
3.1 INTRODUCTION
Digital Signal Processors (DSPs) and application specific integrated
circuits rely on the efficient implementation of arithmetic circuits to execute
dedicated algorithms such as convolution, correlation and filtering. In this
chapter, two new techniques are proposed to implement multiplier circuits.
The first technique uses decomposition logic and improves the overall power
delay product of the multiplier. Decomposition algorithm is tested on
different multiplier circuits such as Carry-Save multiplier, Wallace multiplier
and Dadda multiplier. In this chapter, a guideline to choose the appropriate
decomposition structure for larger multipliers has also been provided. In order
to apply an effective pipelining for Dadda multiplier, a new adder structure
having latched outputs is proposed. The proposed latched adder helps to
reduce the overheads of implementing pipeline structures for a Dadda
multiplier circuit.
The second proposed technique uses bypassing algorithm. The
power dissipation of the circuits employing bypassing algorithm is less, since
the algorithm reduces the spurious switching activities of the circuits.
Bypassing algorithm is tested on different multiplier circuits such as Wallace
multiplier and Dadda multiplier.
All the multiplier circuits are implemented and tested using logic
families such as CMOS, CPL and Hybrid XOR.
52
3.2 RELATED BACKGROUND
Various types of multipliers are discussed in the literature to
achieve power and performance optimization. Column compression
architecture for fast multiplication proposed by Wallace (1964) offers a total
delay which is proportional to the logarithm of the operand word length of the
multiplier. These column compression multipliers are faster than array
multipliers, because delay in an array multiplier varies linearly with the
operand word length. A unique placement strategy for reducing the stage
counters in column compression architecture was proposed by Dadda (1965).
To achieve a compact layout, Goto et al (1992) have proposed a regularly
structured tree multiplier with recurring blocks. Different methods for
compressing the bits in a Wallace tree to achieve improved column
compression are proposed by Oklobdzija and Villeger (1995). Itoh et al
(2001) have proposed a rectangular styled tree multiplier by folding to
achieve a compact layout at the expense of more complicated interconnects.
Wallace multipliers have slightly more area and approximately the same
worst case delay as that of Dadda multipliers in deep submicron technologies
(Andrea Bickerstaff et al, 2001). Wallace tree multipliers based on adiabatic
4-2 compressors proposed by Xien Ye etal (2005) achieve considerable
amount of energy savings but with increased latency.
To reduce the spurious switching in array multiplier designs several
techniques are proposed. Parhami (2000) introduced Carry-Save adders to
minimize the spurious switching. Huang and Ercegovac (2005) have proposed
a method for balancing the signals to the adder at intermediate stage, by
analyzing the signal delay at each stage and thereby deciding the best
connection for the successive stage. Chong et al (2005) have proposed an idea
of introducing latch in adders. Row by passing techniques for array multiplier
53
energy reduction proposed by Obhan et al (2002) eliminate the addition of
zero partial products at the expense of additional shifters and detection
circuitry to align the partial results for accumulation. Column by passing
technique proposed by Wen et al (2005) has a circuit overhead of one
multiplexer per adder cell. Parallel multipliers based on low power adders was
proposed by Rizwan Mudassir and Abid (2005) which offer significant
improvement in speed and power dissipation compared to standard array
multiplier architectures. Senthilpari et al (2007) have analyzed the
performance of 8x8 Carry-Save multipliers using non clocked pass gate
families. High speed Reconfigurable multiplier was proposed by Wei Li et al
(2007) to achieve high performance in block cipher algorithms. Hwang et al
(2007) have proposed enhanced row bypassing schemes for array multipliers
by implementing the multiplexing mechanism using Clocked CMOS
circuitry, in order to resolve the DC power dissipation problem due to voltage
loss in gated signals. Low energy Booth leap frog array multiplier using
dynamic adders is proposed by Chong et al (2007) for low energy and IC area
critical applications. Tzu-Yuan Kuo and Jinn-Shyan Wang (2008) have
proposed a new low voltage latch adder based Wallace tree multiplier. The
multipliers proposed in the existing literature do not offer sufficient
parallelism to minimize the glitch power.
3.3 CONVENTIONAL MULTIPLIERS
Wallace tree, Dadda and Carry-Save multipliers are commonly
available architectures for performing multiplication. Carry-Save multiplier is
derived from an array multiplier. Tree structured multipliers like Wallace and
Dadda fare well, despite their irregularity and excess wiring. This is due to the
fact that tree multipliers offer smaller depth of partial product reduction
54
hardware, which in-turn seems to offset the power loss in wiring. This helps
to reduce the overall power dissipation and delay.
3.3.1 Carry-Save Multiplier
Figure 3.1 shows a 4x4 multiplier implemented using Carry-Save
method. Unlike the normal array multiplier, in Carry-Save multiplier the
output carry bits are propagated diagonally downwards, instead of to the right.
This design will require an extra adder called the vector-merging adder in
order to achieve the final result. It is named so because the carry bits of each
stage are saved to be propagated to the next adder rather than immediate
sideways propagation. For this multiplier there is an increase in the number of
transistors and hence the area occupied also increases.
HA HA HA HA
FA FA FAHA
FAFAFAHA
HA FA FA HA
Vector Merging Adder
Figure 3.1 4x4 Carry-Save Multiplier
3.3.2 Wallace Multiplier
Figure 3.2 shows an 8x8 multiplier proposed by Wallace (1965).
Wallace method employs a three-step process to multiply two numbers:
55
Step 1 : Generate all partial products in parallel using an AND gate array.
Step 2 : If there are ‘p’ rows of partial products, then combine the partial
products such that, 33p
rows of partial products are grouped in
the present stage and the remaining mod 3p rows are passed to the
next stage. This step is repeated until the matrix height is reduced to two-rows.
Step 3 : Sum the two resulting rows with a fast carry propagate adder to
produce the final product.
X0
Y0
X1X2X3X4X5X6X7
Y1Y3Y4Y5Y6Y7 Y2
Y0Y0Y0Y0Y0Y0Y0Y0Y1Y1Y1Y1Y1Y1Y1Y1
Y2Y2Y2Y2Y2Y2Y2Y2Y3Y3Y3Y3Y3Y3Y3Y3
Y4Y5
Y4Y4Y4Y4Y4Y4Y4Y5Y5Y5Y5Y5Y5Y5Y6
Y7
Y6Y6Y6Y6Y6Y6Y6Y7Y7Y7Y7Y7Y7Y7
X0X0
X0X0
X0X0
X0X0
X1X1
X1X1
X1X1
X1X1
X2X2
X2X2
X2X2
X2X2
X3X3
X3X3
X3X3
X3X3
X4X4
X4X4
X4X4
X4X4
X5X5
X5X5
X5X5
X5X5
X6X6
X6X6
X6X6
X6X6
X7X7
X7X7
X7X7
X7X7
Y0X0Y2X7
Y3X0Y5X7
Y6Y7
Y6Y6Y6Y6Y6Y6Y6Y7Y7Y7Y7Y7Y7Y7
X0X0
X1X1
X2X2
X3X3
X4X4
X5X5
X6X6
X7X7
Y0X0Y5X7
Y7X7
Y0X0Y7X7
Y0X0
Z0Z1Z2Z3Z4Z5Z6Z7Z8Z9Z10Z11Z12Z13Z14Z15
X
Figure 3.2 Wallace Multiplier for 8x8 Multiplication
Wallace (1965) showed that the delay for an NxN multiplier can be
reduced to log2N making it faster than array multiplier. Minimizing the delay
56
of the bit-product reduction process is very important in reducing the overall
multiplication time. In Wallace tree multipliers, rows are grouped into sets of
three during each reduction stage. This is done to accomadate, a pseudo-adder
(row comprising of full adders with no carry chain) to add three operand bits
to produce a two bit intermediate result with only a single full adder delay.
Within each three row set, (3,2) counters reduce columns with three bits to
two bits and (2,2) counters reduce columns with two bits to two bits. Rows
that are not part of a three row set are transferred to the next stage without
modification. The height of the matrix in the thj reduction stage, jw is
defined by the following recursive equations (3.1) and (3.2)
0w N (3.1)
1 2. mod 33j jwjw w
(3.2)
3.3.3 Dadda Multiplier
Dadda (1965) generalized and extended Wallace’s results by noting
that a full adder can be thought of as a circuit, which counts the number of
ones in the input and outputs that number in 2-bit binary form. Using such a
counter, Dadda postulated that, at each stage, only minimum amount of
reduction should be done in order to reduce the partial product matrix by a
factor of 1.5. Dadda’s method requires the same number of levels as that of
Wallace method. However Dadda’s method does the minimum reduction
necessary at each level. This results in a design with fewer full adders and
half adders. The number of (3,2) and (2,2) counters required is minimized in
Dadda’s technique compared to Wallace tree. The disadvantage of Dadda’s
method is that it requires a slightly wider fast Carry Propagate Adder (CPA)
and has a less regular structure than Wallace. Figure 3.3 shows a 8x8 Dadda
multiplier.
57
Figure 3.3 Dadda Multiplier for 8x8 Multiplication
The reduction process for a Dadda multiplier is developed using the
following recursive algorithm:
Step 1 : Let 1 2d and 1 1.5j jd d , where jd is the matrix height for the
thj stage from the end. Find the smallest ‘ j ’ such that at least one
column of the original partial product matrix has more than jd bits.
Step 2 : In the thj stage from the end, employ (3,2) and (2,2) counters to
obtain a reduced matrix with no more than jd bits in any column.
58
Step 3 : Let 1j j and repeat step 2 until a matrix with only two rows is
generated.
For a N N bit Dadda multiplier, there are 2N bits in the original
partial product matrix and 4 3N bits in the final two row matrix. The total
number of (3,2) counters required is given by 2 4 3N N . The length of the
carry propagate adder (CPA) is given by 2 2N . The total number of (2,2)
counters required is given by 1N .
3.4 LOGIC FAMILIES CONSIDERED FOR DESIGNING
MULTIPLIERS
Three Logic families namely Complementary Metal Oxide
Semiconductor (CMOS), Complementary Pass-transistor Logic (CPL) and
Hybrid XOR have been chosen for implementing the various multipliers.
3.4.1 CMOS Logic
CMOS logic supports an efficient implementation of the design
with high noise margin and low static power consumption. Output logic level
does not depend on the transistor size and it is a ratioless logic. The rise times
and fall times of the inputs are controlled, tending to be ramps rather than step
functions. The disadvantage lies in the large PMOS transistors which result in
high input and internal capacitances. Also, area requirements are large, i.e., if
there are N inputs, 2N transistors (N each for PUN and PDN) are needed.
Moreover, a weak output driving capability is caused by series transistors.
The basic CMOS full adder implementation is shown in Figure 3.4.
59
A B B
Ci
A
A B B
A
A B Ci
A B Ci
A
B
Ci
Ci
B
A
VDD VDDVDD
CO SO
Figure 3.4 Full Adder using CMOS Logic (Static Mirror Adder)
3.4.2 Complementary Pass-Transistor Logic (CPL)
Figure 3.5 shows a Full adder implemented using CPL. CPL
benefits from the small input capacitances (NMOS network only), fast
differential stage, reduced number of transistors (N transistors for N inputs)
and good output driving capability, making the implementation of complex
gates very efficient. Usually, level restorer circuits (PMOS transistors) are
necessary for swing restoration. The switching speed is very high because of
low threshold voltage. However, this leads to difficulty in switching off these
zero threshold devices. Also, the large number of nodes and transistors and
the two inversion levels result in inefficiency. It has larger short-circuit
currents, higher wiring overhead and increased power consumption compared
to CMOS. The CPL adder shown in Figure 3.5 requires seven inverters to
generate the complement signals. However, when this adder is used in designs
such as multiplier, the input complementary signals can be derived from the
previous stage outputs. This reduces the transistor count. Also the drivability
of the adder is fairly good even without the use of inverters at the output. This
is due to the presence of PMOS pull up transistors. Therefore in complex
60
designs such as multipliers, the output inverters for generating sum and carry
can be used in the alternate stages of the design, thereby improving speed and
reducing area.
A
Bb
Ab
B
A
Ab
Bb
B
F
Fb
C
Cb
Cb
C
S
Sb
F
F
Fb
Fb
C
A
Cb
Ab
Cob
Co
F
F
Fb
Fb
Figure 3.5 Full Adder using CPL
In the proposed work, the inverters and half adders used in the CPL
implementation were also designed using CMOS logic to yield better output
results. CPL causes reduced voltage swing. Hence buffers have been inserted
to restore the logic levels wherever necessary. This has led to slightly higher
power dissipation.
3.4.3 Hybrid XOR Logic
Hybrid XOR proposed by Chang et al (2005) is the combination of
various logic styles for achieving an optimized structure. It comprises of a
61
CPL, Transmission gate (TG) logic and CMOS logic. The adder implemented
in this logic style suits tree structured arithmetic units. The schematic is
shown in Figure 3.6.
A
B
C
S
Co
Figure 3.6 Full Adder using Hybrid XOR Logic
It is designed by optimizing three modules. The first module
generates the XOR and XNOR logic signals of the adder inputs. The sum and
carry are then generated using these signals. The structure is balanced as the
sum and carry outputs are generated simultaneously.
3.5 PROPOSED DECOMPOSITION ALGORITHM
In this thesis, a new technique to implement digital multipliers
using the decomposition logic is presented. Here, the multiplication process is
decomposed into smaller sub-units (smaller multipliers) and their outputs are
62
combined to get the final result. By doing so, parallel processing is also
introduced in addition to the benefits from the structured implementation of
the multiplier. In the first stage, the N N partial products will be split up into
four 2 2N N multiplier blocks. The outputs from these blocks are then combined
in a tree like fashion to get final results. The 2 2N N multiplier is implemented
using Carry-Save structure for decomposed Carry-Save multipliers, using a
Wallace structure for decomposed Wallace multiplier, and using a Dadda
structure for decomposed Dadda multiplier. The decomposition logic requires
extra circuitry to perform final addition of outputs obtained from 2 2N N
multipliers. However, due to parallel processing of the 2 2N N multipliers,
significant improvement in speed is achieved. Since the inputs to the final
adder circuitry arrive in parallel, glitches are reduced resulting in less power
dissipation. Further the decomposition process is extended with 4 4N N
multiplier blocks. The benefits derived from parallel processing of data are
outweighed by degradation due to extra logic circuitry, if the decomposition
process is progressively extended. This occurs because a large number of
intermediate result bits are generated in each column. These intermediate
result bits need to be combined in a tree like fashion by grouping them into
sets of three and sets of two. Hence several levels of reduction is required,
before deducing the number of rows to two.
Figure 3.7 shows the dot diagram of a decomposed 8x8 multiplier
structure. A 8x8 multiplier has eight rows of partial products, each row having
eight terms. In the first stage, the partial products are grouped into four 4x4
multiplier blocks as shown in Figure 3.7. Each 4x4 multiplier block will
generate 8-bits of output.
63
Figure 3.7 Decomposition Structure for 8x8 Multiplication
Hence in the second stage the partial product matrix present in the first stage is reduced to four rows, each row containing 8 bits. In the second stage, the columns containing single bit of information are transferred to the third stage as such and columns containing three bits of information are compressed to two bits of information namely sum and carry using a full adder and then transferred to the third stage. The columns transferred as such and sum bits of the full adder output of second stage are laid as first row in the third stage. The carry bits of the full adder are shifted to the next column and arranged as second row in the third stage. The second row dots are connected with the first row dots by a diagonal line indicating that they are the carry bits generated for the full adders in the second stage. In the third stage the two rows are compressed using a fast adder to generate the final result. Figure 3.8 shows the dot diagram of a decomposed 16x16 multiplier structure using 8x8 multipliers. In the first stage, four 8x8 multiplier blocks are used to combine all the partial products. The outputs from these 8x8 multipliers are then combined in a treelike fashion in a similar manner to produce the final results.
64
8x8 Multiplier
STAGE 1
STAGE 2
Figure 3.8 Decomposition Structure for 16x16 Multiplication using 8x8
Multipliers
Figure 3.9 represents the dot diagram of decomposed 16x16
multiplication process using 4x4 multipliers. It is observed that more number
of addition stages are required to generate final result. This results in excess
hardware, which may account for increased delay and power consumption,
when compared to a 16x16 multiplication performed using 8x8
decomposition.
65
Figure 3.9 Decomposition Structure for 16x16 Multiplication using 4x4
Multipliers
3.6 PROPOSED DESIGN OF A 8X8 PIPELINED DADDA
MULTIPLIER
Pipelined circuits can be constructed by using level sensitive
latches at the output of intermediate stages. Pipelining is a popular design
technique often used to accelerate the operation of data paths in DSPs. Two
pipelined multiplier structures are presented and their performances are
compared. A pipelined multiplier is implemented with the proposed latched
CPL adder and its performance is compared with the pipelined multiplier
structure constructed using a static latch proposed by Uming Ko and Poras
66
Balsara (2000). This latch is named as ‘PowerPC latch’ and is shown in
Figure 3.10. It uses a transmission gate controlled by a clock signal at the
input. The feedback path consists of an inverter and a transmission gate
combined together to reduce power dissipation.
To reduce the overheads (transistor count and power dissipation) of
implementing pipelined multiplier design, a latched adder is proposed by
modifying the CPL adder. The latched CPL adder is shown in Figure 3.11.
The latch portion of the adder is derived from a two phase CPL flip-flop
structure. The structure is pseudo-static and requires only single phase
clocking as opposed to the two phase clocking required for the PowerPC
latch. The latched version of the CPL adder requires only two extra transistors
when compared to the CPL adder. When the PowerPC latch is used at the
output of the adder, 10 transistors are needed. Hence, 8 transistors are saved
by using the Latched CPL adder as compared to PowerPC latch.
CLK
CLKB
CLK
CLKB
D Q
Qb
Figure 3.10 PowerPC Latch
67
A
Bb
Ab
B
A
Ab
Bb
B
F
Fb
C
Cb
Cb
C
S
Sb
F
F
Fb
Fb
C
A
Cb
Ab
Cob
Co
F
F
Fb
Fb
CLK
CLK
Figure 3.11 Latched CPL Adder
Pipelining technique is applied for 8x8 Dadda multiplier
implemented using decomposition logic. The pipelined structure is as shown
in Figure 3.12. Two structures are designed – one using the latched CPL
adder and the other using PowerPC latch. The latched CPL adder is used in
the final addition stage of 4x4 Dadda multiplier, while the PowerPC latch is
used at the outputs from the 4x4 Dadda multiplier. All the other adders used
in the pipelined multiplier were the CPL adder without latch.
68
Figure 3.12 Pipeline Structure
3.7 PROPOSED BYPASSING ALGORITHM FOR WALLACE AND DADDA MULTIPLIERS
In the proposed Wallace and Dadda multiplier architectures, the first step involves the generation of partial products using an AND array. In the second step the partial products are grouped into sets according to the conventional Wallace and Dadda algorithms respectively. Modification of the architecture is proposed in the implementation of (3,2) and (2,2) counters which are key members of partial product reduction hardware. Figure 3.13 shows the circuit diagram of the tri-state buffer.
cs
O/PI/P
Figure 3.13 Tri-state Buffer
Tri-state buffer is constructed using transmission gate logic. The control signal for the tri-state buffer is ‘cs’. The architecture of the proposed
69
(3,2) counter is shown in Figure 3.14. The proposed (3,2) counter has a full adder, OR gate, two tri-state buffers and two (2x1) multiplexers. The proposed (3,2) counter has three inputs named as ‘a’, ‘b’ and ‘ci’ and two outputs named ‘s’ and ‘c’. The inputs operands ‘a’ and ‘b’ are passed through a tri-state buffer to excite the input terminals of the (3,2) counter. The control signal ‘cs’ for two tri-state buffers and the two (2x1) multiplexers is derived by performing logic ‘or’ operation on the inputs ‘a’ and ‘b’. The control signal is high when atleast one of the operand bits are high and the control signal is low when both the operand bits are low. When the control signal ‘cs’ is high, the input operands ‘a’ and ‘b’ propagate through the buffer and stimulate the input terminals of the full adder. The full adder cell is activated and addition of the bits ‘a’, ‘b’ and ‘ci’ is performed. This yields the sum output ‘s1’ and the carry output ‘c1’ of the full adder.
Full Adder
c1 s1
0 1
c s
Vss
0 1
cs
a
b
I II
a
b ci
Figure 3.14 Proposed (3,2) Counter
70
These outputs ‘s1’ and ‘c1’ then propagate through the (2x1)
multiplexers labeled ‘I’ and ‘II’ respectively to produce the sum output ‘s’
and carry output ‘c’ of the (3,2) counter. When the control signal ‘cs’ is low,
the tri-state buffers are open circuited which in turn causes the full adder cell
to get deactivated. Hence the third input operand ‘ci’ of the proposed (3,2)
counter is directly routed through the (2x1) multiplexer labeled as ‘I’ to
generate the sum output ‘s’ and logic ‘0’ is routed through the (2x1)
multiplexer labeled as ‘II’ to generate the carry output ‘c’ of the (3,2)
counter.
Half Adder
c1 s1
0 1
c s
Vss
0 1
cs
a b
I II
a
Figure 3.15 Proposed (2,2) Counter
The architecture of the proposed (2,2) counter is shown in
Figure 3.15. The proposed (2,2) counter has one half adder, one tri-state
buffer and two multiplexers. The proposed (2,2) counter has two input
operands ‘a’ and ‘b’ and two outputs named ‘s’ and ‘c’. The input operand
71
‘a’ also acts as a control signal for the tri-state buffer and the two (2x1)
multiplexers. When the control signal ‘cs’ is high, input operand ‘a’
propagates through the tri-state buffer and stimulate the input terminals of
half adder cell. The half adder is enabled and it performs addition of two bits
‘a’ and ‘b’. This results in generation of sum output ‘s1’ and carry output ‘c1’
for the half adder. These outputs ‘s1’ and ‘c1’ then propagate through the
(2x 1) multiplexers labeled ‘I’ and ‘II’ respectively to produce the sum output
‘s’ and carry output ‘c’ of the (2,2) counter. When the control signal ‘cs’ is
low, the tri-state buffer is open circuited which in turn deactivates the half
adder cell. Hence the second input operand ‘b’ of the proposed (2,2) counter
is directly routed through the (2x1) multiplexer labeled as ‘I’ to generate the
sum output ‘s’ and logic ‘0’ is routed through the (2x1) multiplexer labeled as
‘II’ to generate the carry output ‘c’ of the (2,2) counter.
The partial products are reduced in a progressive manner and
finally two rows are deduced. Final step involves the addition of the two rows
using a fast carry propagate adder.
3.8 SIMULATION
Simulation for the multiplier designs was done using Tanner EDA
tool. The parameters considered for evaluating the proposed multiplier
structures are power, delay, power-delay product and transistor count.
3.8.1 Simulation Results for Decomposition Algorithm
The proposed decomposition algorithm is tested on Carry-Save,
Wallace and Dadda multipliers.
72
3.8.1.1 Results of carry-save multipliers
The Carry-Save multiplier circuits were simulated using TSMC
180 nm technology. The threshold voltages of NMOS and PMOS transistors
are kept as 0.39 V and -0.41 V respectively. The supply voltage is set to
1.8 V for all modules with rise and fall times of the input set to 0.10 ns.
Tables 3.1 and 3.2 list a comparative study on Carry-Save multipliers with
and without decomposition for various performance parameters.
Table 3.1 Results for 8x8 Carry-Save Multiplier with and without
Decomposition
Performance Parameters
8x8 Carry-Save Multiplier without Decomposition
8x8 Carry-Save Multiplier using 4x4 Decomposition
CMOS HYBRID
XOR CPL CMOS HYBRID
XOR CPL
Average power (mw)
0. 4504 0.62 1.024 0.4254 0.479 1.069
Delay (ns) 1 1.8 1.571 0.857 1.67 1
Power delay Product
( pico Joules) 0.4504 1.116 1.6171 0.3643 0.7999 1.069
Transistor count 2688 1888 2192 3072 2880 2659
73
Table 3.2 Results for 16x16 Carry-Save Multiplier with and without
Decomposition
Performance Parameters
16x16 Carry-Save Multiplier without
Decomposition
16x16 Carry-Save Multiplier using 8x8
Decomposition
CMOS HYBRID
XOR CPL CMOS
HYBRID XOR
CPL
Average power (mw) 3.90 8.15 8.17 3.35 7.70 7.59
Delay (ns) 5.192 3.327 4.091 3.076 2.702 2.692
Power-Delay Product
(pico Joules) 20.248 27.115 33.405 10.303 20.805 20.432
No. of Transistors
13776 7892 11780 11520 12800 8992
From Tables 3.1 and 3.2 it is inferred that Carry-Save multipliers
implemented using decomposition logic have reduced power consumption
and delay compared to Carry-Save multipliers implemented without
decomposition. This is due to the fact that decomposed multiplier blocks
possess parallelism in computation and also the output signals from the
partitioned blocks have same arrival times. Since the signals for the next stage
arrive at same time, the glitches will get eliminated. This inturn accounts for
reduced power dissipation as well. The transistor count for the 8x8 Carry-
Save multiplier implemented using 4x4 decomposed blocks has increased
compared to a 8x8 Carry-Save multiplier implemented without
decomposition. This indicates that there is a marginal area overhead.
74
Figures 3.16 to 3.18 show the comparison graph of delay, power
and power-delay product for a 8x8 Carry-Save multiplier without
decomposition and 8x8 Carry-Save multiplier using 4x4 decomposition.
Figure 3.16 Delay Comparison of Carry-Save Multiplier with and
without Decomposition
Figure 3.17 Power Comparison of Carry-Save Multiplier with and
without Decomposition
75
Figure 3.18 Power-Delay Product Comparison of Carry-Save Multiplier with and without Decomposition
3.8.1.2 Results of wallace tree multipliers
Wallace Tree multipliers are simulated using TSMC 180 nm technology. For 180 nm technology, the threshold voltages of NMOS and PMOS transistors are kept as 0.39 V and -0.41 V respectively. The supply voltage is set to 1.8 V for all modules with rise and fall times of the input set to 0.10 ns.
The Wallace multipliers with and without decomposition are also simulated and tested for supply voltage variations and technology variations. The length and width specifications for 180 nm technology is as follows: NMOS: L=180 nm and W=270 nm; PMOS: L=180 nm and W=810 nm. The length and width specifications for 130 nm technology is as follows: NMOS: L=130 nm and W=195 nm; PMOS: L=130 nm and W=585 nm. For TSMC 130 nm technology, threshold voltages of NMOS and PMOS transistors are around 0.332 V and -0.3499 V respectively. The input patterns were switched at a frequency of 50 MHz. The rise and fall times of the input is set to 0.10
76
ns. Tables 3.3 and 3.4 show a comparative study on Wallace multipliers with and without decomposition for various performance parameters.
Table 3.3 Simulation Results of 8x8 Wallace Multiplier with and
without Decomposition
Performance Parameters
8x8 Wallace Multiplier without Decomposition
8x8 Wallace Multiplier using 4x4 Decomposition
CMOS HYBRID XOR CPL CMOS HYBRID
XOR CPL
Average power (mw) 0. 911 0.765 0.342 0. 431 0.560 0.321
Delay (ns) 1.05 1.3 1.39 1.05 1.10 0.98 Power delay
Product (pico Joules)
0.956 0.994 0.475 0.452 0.616 0.314
Transistor count 2492 3102 2560 2292 2902 2120
Table 3.4 Simulation Results of 16x16 Wallace Multiplier with and
without Decomposition
Performance Parameters
16x16 WALLACE Multiplier without
Decomposition
16x16 WALLACE Multiplier using 8x8
Decomposition
16x16 WALLACE Multiplier using 4x4
Decomposition
CMOS HYBRID XOR CPL CMOS HYBRID
XOR CPL CMOS HYBRID CPL
Average power (mw) 1.943 9.231 5.32 1.275 7.801 3.424 1.515 8.799 4.168
Delay (ns) 3.3 2.1 2.4 3.1 1.8 1.64 3.6 1.9 1.897 Power-Delay
Product (pico Joules)
6.411 19.38 12.76 3.952 14.04 5.615 5.454 16.718 7.906
Transistor Count
11434 13214 9834 10164 12854 9664 11700 15126 12768
From Tables 3.3 and 3.4 it is observed that power-delay product for the Wallace multipliers implemented using decomposition process is least compared to Wallace multipliers implemented without decomposition for all
77
logic families considered. This is due to the effect of parallel processing encountered in computation of the partitioned blocks in first stage. Further it can be observed that 16x16 multipliers implemented using 8x8 decomposed blocks have lesser power-delay product compared to 16x16 multipliers implemented using 4x4 decomposed blocks. This is because decomposed structure using 4x4 partitioned blocks yield sixteen rows of intermediate partial product compared to only four rows of reduced intermediate partial product after the first stage. This increase in the number of rows present in the intermediate partial product causes more number of stages of computation to compress these rows suitably using adders to generate the final result. Hence for 16x16 multipliers implemented using 4x4 blocks the hardware requirement is more which is indicated directly by the transistor count. Further the critical path delay and the power consumed will increase for 16x16 Wallace multipliers implemented using 4x4 partitioned blocks. As a generalization it can be said that N N multipliers implemented using
2 2N N multiplier blocks possess the least power delay product for all logic
families. Figures 3.19 and 3.20 show the power-delay product comparison of 8x8 and 16x16 Wallace tree multipliers with and without decomposition.
Figure 3.19 Power-Delay Product Comparison of 8x8 Wallace Tree
Multiplier with and without Decomposition
78
Figure 3.20 Power-Delay Product Comparison of a 16x16 Wallace Tree
Multiplier with and without Decomposition
Tables 3.5 to 3.8 list a comparative study on Wallace multipliers
with and without decomposition for various performance parameters based on
supply voltage variations and technology variations.
Table 3.5 Results of 8x8 Decomposed Multipliers for Supply Voltage
Variations in 180 nm Technology
Performance Parameter
Supply Voltage (volts)
Simulation Results for a 8 x8 Wallace Multiplier
Proposed Decomposition Algorithm (Using 4x4 Wallace Multipliers)
Without Decomposition
CMOS CPL Hybrid XOR CMOS CPL Hybrid
XOR
Delay (ns) 1.8 0.84 1.16 0.85 1.22 1.6 1.12 1.6 0.98 1.32 1.39 1.67 1.92 1.79
Power (mW) 1.8 0.390 0.174 0.72 0.467 0.299 0.96 1.6 0.146 0.138 0.407 0.205 0.210 0.53
Power-Delay Product
(pico Joules)
1.8 327.6 201.84 612 569.74 478.4 1075.2
1.6 143.08 182.16 565.73 342.35 403.2 948.7
79
Table 3.6 Results of 8x8 Decomposed Multipliers for Supply Voltage
Variations in 130 nm Technology
Performance Parameter
Supply Voltage (volts)
Simulation Results for a 8 x8 Wallace Multiplier
Proposed Decomposition
Algorithm (Using 4x4 Wallace Multipliers)
Without Decomposition
CMOS CPL Hybrid XOR CMOS CPL Hybrid
XOR
Delay (ns) 1.3 0.67 1.74 1.09 0.93 2.78 1.52 1.1 0.85 1.98 1.25 1.25 3.26 2.03
Power (µW) 1.3 57.34 73.18 130.3 80.32 117.6 279.8 1.1 39.10 46.77 83.92 55.09 65.73 172.2
Power-Delay Product
(pico Joules)
1.3 38.41 127.33 142.02 74.69 326.92 425.29
1.1 33.23 92.604 104.9 68.86 214.27 349.56
Table 3.7 Results of 16x16 Decomposed Multipliers for Supply Voltage
Variations in 180 nm Technology
Performance Parameter
Supply Voltage (volts)
Simulation Results for a 16x16 Wallace Multiplier Proposed Decomposition Algorithm
Without Decomposition
Using 8x8 Wallace Multipliers
Using 4x4 Wallace Multipliers
CMOS CPL Hybrid XOR CMOS CPL Hybrid
XOR CMOS CPL Hybrid XOR
Delay (ns) 1.8 1.17 2.27 1.94 1.42 2.982 2.44 1.69 3.70 3.06 1.6 1.84 2.52 2.17 2.35 3.34 3.03 2.52 4.12 3.79
Power (mW) 1.8 1.25 1.645 6.149 1.52 1.634 6.79 1.72 1.597 7.01 1.6 0.739 1.132 2.020 0.869 1.22 2.451 0.942 1.146 2.772
Power-Delay Product
(pico Joules)
1.8 1.465 3.7342 11.929 2.154 4.872 16.567 2.906 5.908 21.45
1.6 1.3596 2.8526 4.383 2.042 4.074 7.4265 2.37 4.7215 10.50
80
Table 3.8 Results of 16x16 Decomposed Multipliers for Supply Voltage
Variations in 130 nm Technology
Performance Parameter
Supply Voltage (volts)
Simulation Results for a 16x16 Wallace Multiplier Proposed Decomposition Algorithm
Without Decomposition Using 8x8 Wallace Multipliers
Using 4x4 Wallace Multipliers
CMOS CPL Hybrid XOR CMOS CPL Hybrid
XOR CMOS CPL Hybrid XOR
Delay (ns) 1.3 1.08 1.51 2.21 1.24 1.78 2.77 1.52 2.45 3.12
1.1 1.52 1.93 2.42 1.73 2.47 2.98 2.06 3.12 3.98
Power (mW) 1.3 0.284 0.644 0.628 0.334 0.569 0.775 0.487 0.671 0.788
1.1 0.195 0.391 0.411 0.228 0.326 0.507 0.293 0.462 0.464
Power-Delay Product
(pico Joules)
1.3 306.72 972.4 1387.88 414.16 1012.8 2146.75 740.24 1643.9 2458.56
1.1 296.4 754.63 994.62 394.4 805.22 1510.86 603.5 1441.4 1846.72
From Tables 3.5 to 3.8, it is observed that the 8x8 Wallace tree multipliers implemented using 4x4 decomposition and 16x16 Wallace tree multipliers implemented using 8x8 decomposition have the least power-delay product in all the cases. It can be concluded that same trend of results is achieved for the Wallace multipliers implemented with decomposition for supply voltage variations and technology variations.
3.8.1.3 Results of Dadda multipliers
Dadda multipliers are simulated using TSMC 180 nm technology. The threshold voltages of NMOS and PMOS transistors are kept as 0.39 V and -0.41 V respectively. The supply voltage is set to 1.8 V for all modules with rise and fall times of the input set to 0.10 ns. To account for process variation, Dadda Multiplier circuits were further tested at different supply voltages ranging from 1.0 V to 1.8 V. The two pipelined structures were then compared for their power dissipation values and number of transistors used.
81
Table 3.9 Results for 8x8 Dadda Multiplier with and without Decomposition
8x8 Dadda Multiplier Designed using CPL Adder Performance Comparison of 8x8 Dadda Multiplier
Supply voltage
(V)
Power (µW) Delay (ns) Critical Path Delay
Improvement Savings %
Power-delay product ( X 10-15 Joules)
Power –Delay
Product Savings %
Using 4x4 Decomposition
Without Decomposition
Using 4x4 Decomposition
Without Decomposition
Using 4x4 Decomposition
Without Decomposition
1.8 567 569 1.12 1.51 26.19 635.04 859.19 26.19
1.5 184 189 1.45 1.92 26.48 266.80 362.88 26.48
1.2 112 117 2.51 3.23 25.61 281.12 377.91 25.61
1.0 76.7 80.8 4.00 5.19 26.83 306.80 419.352 26.83
Transistor Count
Using 4x4 Decomposition Without Decomposition
1648 1476
82
Table 3.10 Results for a 16x16 Dadda Multiplier with and without Decomposition
16x16 Dadda Multiplier Designed using CPL Adder Performance Comparison
Supply voltage
(V)
Power (mW) Delay (ns) Power-Delay Product ( X 10-12 Joules)
Power-Delay Product Savings with respect
to Without Decomposition
in %
Without Decomposition
Decomposition Process Without
Decomposition
Decomposition Process Without
Decomposition
Decomposition Process For 8x8
Dadda
For 8x8 Decomposed
Structure Using 8x8
Dadda
Using 8x8 Decomposed
Structure
Using 8x8
Dadda
Using 8x8 Decomposed
Structure
using 8x8
Dadda
using 8x8 Decomposed
Structure 1.8 2.696 2.774 2.547 1.71 1.41 1.54 4.61016 3.91134 3.92238 15.15 14.91 1.5 0.890 0.933 0.862 2.85 2.00 2.51 2.3585 1.866 2.16362 20.88 8.26 1.2 0.533 0.569 0.516 5.46 3.18 4.14 2.9101 1.80942 2.13624 37.82 26.59 1.0 0.183 0.196 0.178 8.71 5.05 6.63 1.59393 0.9898 1.18014 35.69 25.96
Transistor Count
Without Decomposition Decomposition Process
Using 8x8 Dadda Using 8x8 Decomposed Structure 6762 6792 7480
83
The simulation results of 8x8 Dadda multiplier and 16x16 Dadda
multiplier with and without decomposition for power supply variations are
summarized in Tables 3.9 and 3.10 respectively. In Table 3.9, the results for
two types of decomposition are listed. They are decomposition based on (8x8
Dadda) and decomposition based on (8x8 Dadda multiplier implemented
using 4x4 decomposition), termed as 8x8 decomposed structure.
It is observed from Table 3.9 that, for the 8x8 multiplier structure,
the decomposition logic shows an improvement of 22% to 25% in delay
compared to Dadda’s method due to parallel processing of data. The power
dissipation is slightly less than that of the Dadda structure due to reduction in
glitches in spite of the extra logic circuitry. The power-delay product is
reduced by about 25% to 27%. From Table 3.10, it is also inferred that the
delay of the 16x16 Dadda multipliers implemented using decomposition
process is less compared to 16x16 Dadda multiplier implemented without
decomposition. This is because decomposition process incorporates the effect
of parallel processing.
The 16x16 Dadda multiplier implemented by decomposition using
8x8 Dadda partitioned blocks is faster compared to the other decomposed
structure. This is due to the fact that 8x8 decomposed structure require more
number of stages of computations after parallel processing to achieve the final
result. It is observed that about 17% to 42% improvement in speed can be
achieved for 16x16 Dadda multiplier implemented using decomposition (8x8
Dadda) compared to 16x16 Dadda multiplier implemented without
decomposition. Further, a reduction of about 15% to 35% can be achieved in
power-delay product for 16x16 Dadda multiplier implemented using
decomposition (8x8 Dadda) compared to 16x16 Dadda multiplier
implemented without decomposition. Similarly about 8% to 26%
improvement in power-delay product can be obtained for 16x16 Dadda
84
multiplier implemented using decomposition (8x8 Decomposed Structure)
compared to 16x16 Dadda multiplier implemented without decomposition.
The simulation results for the two pipelined Dadda multiplier
structures are shown in Table 3.11. It can be observed that the latched CPL
adder reduces the overhead for pipelined structures compared to the use of
separate latches for pipelined multiplier design.
Table 3.11 Simulation results for 8x8 Pipelined Dadda Multiplier
structures
Power Results Supply Voltage
(V) Latched CPL Adder (µW)
PowerPC Latch (µW)
Savings %
1.8 652 720 9.444
1.5 422 470 10.21
1.2 125 142 11.97
1.0 85.6 98.5 13.09
Transistor Count
No. of Transistors
Latched CPL Adder
PowerPC Latch
Savings %
1840 1976 6.882
3.8.2 Simulation Results for Bypassing Algorithm
The average power consumed, worst case delay and power-delay
product for 4x4 and 16x16 Wallace multipliers with and without bypassing of
Partial Products is listed in Tables 3.12 and 3.13 respectively.
85
Table 3.12 Simulation Results for 4x4 Wallace Multiplier with and
without Bypassing
Performance Parameters
Logic Family Used for Adder Implementation CMOS CPL Hybrid XOR
Without Bypassing
With Bypassing
Without Bypassing
With Bypassing
Without Bypassing
With Bypassing
Average Power Consumed (mW)
0.523
0.48 1.57 1.25 2.94 2.66
Delay (ns) 0.65 0.39 0.56 0.27 0.25 0.18 Power-Delay
Product (pico Joules )
0.339 0.187 0.879 0.337 0.735 0.478
Table 3.13 Simulation Results for 16x16 Wallace Multiplier with and
without Bypassing
Performance Parameters
Logic Family Used for Adder Implementation CMOS CPL Hybrid XOR
Without Bypassing
With Bypassing
Without Bypassing
With Bypassing
Without Bypassing
With Bypassing
Average Power Consumed (mW) 1.943 1.802 5.32 4.22 9.231 8.78
Delay (ns) 3.3 2.75 2.4 1.6 1.1 0.9 Power-Delay
Product (pico Joules )
6.419 4.995 12.768 6.752 10.16 7.902
Figures 3.21 and 3.22 indicate the power-delay product comparison
of 4x4 and 16x16 Wallace tree multipliers with and without bypassing
respectively
86
Figure 3.21 Power-Delay Product Comparison of a 4x4 Wallace
Multiplier with and without Bypassing
Figure 3.22 Power-Delay Product Comparison of a 16x16 Wallace
Multiplier with and without Bypassing
87
From Figures 3.21 and 3.22, it is observed that a large improvement
in power-delay product occurs, when the proposed bypassed Wallace
architectures are implemented using CPL logic family. Hybrid XOR family
for the proposed bypassed Wallace architectures possess the least delay. The
average power consumed, worst case delay and power-delay product for 4x4
and 16x16 Dadda multipliers with and without bypassing of partial products
is listed in Tables 3.14 and 3.15 respectively.
Table 3.14 Simulation Results for 4x4 Dadda Multiplier with and
without Bypassing
Performance Parameters
Logic Family Used for Adder Implementation CMOS CPL Hybrid XOR
Without Bypassing
With Bypassing
Without Bypassing
With Bypassing
Without Bypassing
With Bypassing
Average Power Consumed (mW) 0.76 0.65 1.67 1.31 2.55 2.41
Delay (nS) 0.57 0.42 0.44 0.22 0.36 0.18 Power-Delay
Product (pico Joules )
0.433 0.273 0.7348 0.288 0.918 0.433
Table 3.15 Simulation Results for 16x16 Dadda Multiplier with and
without Bypassing
Performance Parameters
Logic Family Used for Adder Implementation CMOS CPL Hybrid XOR
Without Bypassing
With Bypassing
Without Bypassing
With Bypassing
Without Bypassing
With Bypassing
Average Power Consumed (mW) 3.8 3.5 5.09 4.22 9.11 8.78
Delay (nS) 1.76 1.49 1.52 1.43 1.1 0.9 Power-Delay
Product (pico Joules )
6.688 5.22 7.736 6.034 10.021 7.902
88
Figures 3.23 and 3.24 show the power-delay product comparison
for 4x4 and 16x16 Dadda multipliers with and without bypassing
respectively.
Figure 3.23 Power-Delay Product Comparison of a 4x4 Dadda
Multiplier with and without Bypassing
Figure 3.24 Power-Delay Product Comparison of a 16x16 Dadda
Multiplier with and without Bypassing
89
Tables 3.14 and 3.15 reveal that CPL family offers largest
improvement in power-delay product, Hybrid XOR family offers least delay
for the proposed Dadda multiplier architectures. Bypassing technique is most
effective when the input operand bits have more number of zeros than ones.
This is because when there are more number of operand bits with value logic
‘0’, more number of (3,2) and (2,2) counters will operate in bypass mode.
This yields reduced delay and reduced power because computation units are
deactivated. Hence we achieve a large reduction in power-delay product.
Table 3.16 lists the comparison of the Proposed 8x8 Carry-Save
multiplier using 4x4 decomposition algorithm with recent related work in the
literature.
Table 3.16 Performance Comparison of the Proposed 8x8 Carry-Save
Multiplier with Related Work in the Literature
CPL Adder used for Implementing all the Multipliers in 180 nm Technology with Supply Voltage 1.8 V
Parameter
8x8 Carry-Save
Multiplier by Senthilpari et al (2007)
8x8 Multiplier Proposed by Rizwan Mudassir and Abid (2005)
Proposed 8x8 Carry-Save Multiplier using 4x4
Decomposition
Regular Array
Multiplier
Architecture -I
Architecture -II
Power (mW) - 2.63 1.842 1.416 1.069
Delay (ns) 1.4138 1.298 1.167 1.159 1
From Table 3.16 it is evident that the Proposed 8x8 Carry-Save
multiplier achieves about 24.48% reduction in power compared to 8x8 Carry-
Save multiplier proposed by Senthilpari et al (2005). The Proposed 8x8
Carry-Save multiplier implemented using 4x4 decomposition offers about
22.95%, 14.31% and 13.71% improvement in speed compared to regular
array multiplier, Architecture-I and Architecture-II respectively. Moreover,
90
about 59.39%, 41.96% and 24.5% power savings is achieved for the Proposed
8x8 Carry-Save multiplier compared to regular array multiplier, Architecture-
I and Architecture-II respectively.
Tables 3.17 and 3.18 list out the comparison of the Proposed
decomposition based Wallace and Dadda multipliers with related work in the
literature.
Table 3.17 Comparison of the Proposed Wallace Multipliers with
Previous Work in the Literature
All the Multipliers in 180 nm Technology with Supply Voltage 1.8 V
Parameter 8x8 Wallace Multiplier
Proposed by Andrea et al (2001)
Proposed 8x8 Wallace Multiplier using 4x4
Decomposition with CPL adder Delay (ns) 1.6 0.98
All the Multipliers in 180 nm Technology with Supply Voltage 1.8 V
16x16 Reconfigurable Multiplier Proposed by Wei Li et al (2007)
16x16 Wallace Multiplier
Proposed by Andrea
et al (2001)
Proposed 16x16 Wallace
Multiplier using 8x8
Decomposition with CPL adder
Design A Design B
Delay (ns) 3.5 2.8 2 1.64
From Table 3.17 it is inferred that about 38.75% improvement in
speed is achieved for the proposed 8x8 multiplier compared to 8x8 Wallace
multiplier proposed by Andrea et al (2001). Further about 18% and 41.4%
improvement in critical path delay is observed for proposed 16x16 multiplier
compared to 16x16 Wallace multipliers proposed by Andrea et al (2001) and
Design B of Wei Li et al (2007) respectively.
91
Table 3.18 Comparison of the Proposed Dadda Multiplier with Previous
work in the Literature
Performance Parameter
Multiplier Implementations in 180 nm Technology 8x8 Dadda Multiplier
Proposed by Andrea etal (2001)
Proposed 8x8 Dadda Multiplier % Improvement
Delay (ns) 1.6 1.12 30 16x16 Dadda Multiplier
Proposed by Andrea etal (2001)
Proposed 16x16 Dadda Multiplier using 8x8 Dadda
% Improvement
Delay (ns) 1.9 1.41 25.78
3.9 CONCLUSION
A new technique of implementing digital multipliers using
decomposition logic is proposed. The 8x8 Carry-Save multiplier implemented
using decomposition technique offers 14.3% improvement in speed for
CMOS logic family, 7.22% improvement in speed for Hybrid XOR logic
family and 36.34 % improvement in speed for CPL logic family. The power-
delay product in the 8x8 decomposed Carry-Save multiplier is 19.11% lesser
for CMOS logic family, 33.88% lesser for CPL logic family and 28.32%
lesser for Hybrid XOR logic family. This reduction in power-delay product
may be greatly useful in areas like embedded systems and signal processing.
The 16x16 multipliers implemented using 4x4 decomposed structures
introduce excess delay in CMOS logic when compared to other families. This
is due to the fact that there are more transistors in the critical path.
When compared to the Carry-Save, Wallace and Dadda multipliers,
the proposed multipliers were faster and energy efficient in spite of extra logic
circuitry. It is clearly observed that N N multipliers decomposed and
92
implemented using 2 2N N multiplier blocks perform well both in terms of
power and delay. Multiplier implemented using 4 4N N decomposed blocks
introduce a significant amount of latency and power consumption. This is due
to large number of stages involved to combine the intermediate results to
produce the final result. A pipelined implementation of the decomposition
multiplier structure has been presented, using a new concept of adders having
latched outputs which reduces the overhead costs in pipelined
implementations. Proposed bypassing technique for Wallace and Dadda
multipliers yields a significant amount of power savings.
Proposed bypassing algorithm for a Wallace tree multiplier
provides about 7% to 8% power savings for CMOS logic family, 18% to
20% power savings for CPL logic family and about 4% to 9% power savings
for Hybrid XOR logic family. Proposed bypassing algorithm for a Dadda tree
multiplier can achieve about 8% to 14.5% power savings for CMOS logic
family, 17% to 21.5% power savings for CPL logic family and about 3% to
5% power savings for Hybrid XOR logic family. The results reveal that CPL
logic family yields the largest reduction in power-delay product, about nearly
50% improvement for bypassed structures when compared to CMOS and
Hybrid XOR families. Also it is possible to achieve 20% to 50%
improvement in speed. This large improvement in power and delay is
achieved at the expense of small increase in area overhead that is accounted
by the buffers and the multiplexers which are inserted in the design for each
full adder and half adder. Bypassing algorithm is more effective when the
operand bits have more number of zeros than ones.