low power parallel multiplier with column bypassing

2
Low-power parallel multiplier with column bypassing M.-C. Wen, S.-J. Wang and Y.-N. Lin A low-power parallel multiplier design, in which some columns in the multiplier array can be turned-off whenever their outputs are known, is proposed. This design maintains the original array structure without introducing extra boundary cells, as was the case in previous designs. Experimental results show that it saves 10% of power for random input. Higher power reduction can be achieved if the operands contain more 0’s than 1’s. Introduction: Multiplication is an essential arithmetic operation for common DSP applications, such as filtering and fast Fourier transform (FFT). To achieve high execution speed, parallel array multipliers are widely used. These multipliers tend to consume most of the power in DSP computations, and thus power-efficient multipliers are very important for the design of low-power DSP systems. CMOS is currently the dominant technology in digital VLSI. Two components contribute to the power dissipation in CMOS circuits. The static dissipation is due to leakage current, while dynamic power dissipation is due to switching transient current as well as charging and discharging of load capacitances. Since the amount of leakage current is usually small, the major source of power dissipation in CMOS circuits is the dynamic power dissipation. Dynamic power dissipation appears only when a CMOS gate switches from one stable state to another. Thus, the power consumption can be reduced if one can reduce the switching activity of a given logic circuit without changing its function. Many low-power multiplier designs can be found in the literature. A straightforward approach is to design a full adder (FA) that consumes less power [1]. Power reduction can also be achieved through structural modification. For example, rows of partial products can be ignored [2]. Parallel multiplier: Consider the multiplication of two unsigned n-bit numbers, where A ¼ a n1 a n2 , ... , a 0 is the multiplicand and B ¼ b n1 b n2 , ... , b 0 is the multiplier. The product P ¼ p 2n1 p 2n2 , ... , p 0 , can be written as follows: P ¼ X n1 i¼0 X n1 j¼0 ða i b j Þ2 iþj An array implementation, known as the Braun multiplier [3], is shown in Fig. 1. On the other hand, the Baugh-Wooley multiplier uses the same array structure to handle 2’s complement multiplication, with some of the partial products replaced by their complements. The multiplier array consists of (n 1) rows of CSA, in which each row contains (n 1) FA cells. Each FA in the CSA array has two outputs: the sum bit goes down while the carry bit goes to the lower-left FA. For an FA in the first row, there are only twovalid inputs, and the third input bit is set two 0. Therefore, it can be replaced by a two-input half-adder. The last row is a ripple adder for carry propagation. In this Letter, we propose a low-power design for this multiplier. Fig. 1 4 4 Braun multiplier Low-power multipliers with row-bypassing: A low-power multiplier design may disable the operations in some rows to save power [2]. If bit b j is 0, all partial products a i b j ,0 i n 1, are zero. Therefore, the additions in the corresponding row in Fig. 1 can be bypassed. The row- bypassing multiplier is shown in Fig. 2. Each cell in the CSA array is augmented with three tri-state gates and two multiplexers. For exam- ple, let b 2 be 0 in Fig. 2. In this case, the CSA in the second row (enclosed in the circle) can be bypassed, and the outputs from the first row are fed directly to the third row CSA. However, since the rightmost FA in the second row is disabled, it does not execute the addition and thus the output is not correct. To remedy this problem, an extra circuit must be added, and these elements locate in the triangle area in Fig. 2. P 7 ab 13 P 6 P 5 P 4 P 3 P 2 P 1 P 0 ab 23 ab 33 ab 03 ab 32 ab 22 ab 12 ab 02 ab 31 ab 21 ab 11 ab 01 ab 0 0 ab 10 ab 20 ab 30 + + + + + + + + + + + + + + + 0 0 0 0 0 b 3 –b 2 b 2 -b 1 01 10 01 10 01 10 01 10 01 10 01 10 01 10 01 10 01 10 0 0 0 -b 3 Fig. 2 4 4 Braun multiplier with row-bypassing Proposed method: Instead of bypassing rows of full adders, we propose a multiplier design in which columns of adders are bypassed. In this approach, the operations in a column can be disabled if the corresponding bit in the multiplicand is 0. There are two advantages to this approach. First, it eliminates the extra correcting circuit as shown in Fig. 2. Secondly, the modified FA is simpler than that used in the row-bypassing multiplier. Assume that we execute 1010 1111 in Fig. 1. It can be verified that, for FAs in the first and third diagonals, two out of the three input bits are 0: the ‘carry’ bit from its upper right FA, and the partial product a i b j (note that a 0 ¼ a 2 ¼ 0). As a result, the output carry bit of such an FA is 0, and the output sum bit is simply equal to the third bit, which is the ‘sum’ output of its upper FA. The following theorem shows that this is true in general. Therefore, when a i is 0, the operations in the correspond- ing diagonal can be disabled since all the outputs are known. We refer to the FAs in a diagonal in Fig. 1 as a column. Let FA i, j be the full adders locating in row i and column j,0 i, j n 2, in the (n 1) (n 1) array, as shown in Fig. 1. FA 0,0 is the adder at the upper-right corner. The following theorem establishes reason for column bypassing. Theorem 1: When a j ¼ 0, the output of a column j adder cell FA i, j can be specified as follows. 1. The output carry bit is 0. 2. The output sum bit is equal to the output sum bit of FA i1, jþ1 . Proof: We prove this theorem by induction. 1. Consider row 0. Note that, in row 0, there are only two bits to be added. Adder FA 0, j carries out a j b 1 þ a jþ1 b 0 . If a j ¼ 0, then the output carry bit must be zero, and the out sum bit is equal to a jþ1 b 0 . 2. Assume that the theorem holds for row i. 3. In row i þ 1, the inputs of FA iþ1, j are carry bit from FA i, j , sum bit from FA i, jþ1 , and the partial product a j b iþ1 . Since a j ¼ 0, two out of the three inputs are 0, and the output sum bit is equal to the sum bit sent by FA i, jþ1 . According to theorem 1, when a j ¼ 0, the operations in column j can be ignored and thus the full adders can be disabled since the outputs are known. ab 33 P 7 P 6 P 5 P 4 P 3 P 2 P 1 P 0 + + + + + + + + + + + + ab 23 a 2 a 1 a 0 10 10 10 10 10 10 10 10 10 ab 13 ab 03 ab 32 ab 22 ab 12 ab 02 ab 31 ab 21 ab 11 ab 30 ab 20 ab 10 ab 0 0 ab 01 Fig. 3 4 4 column-bypassing multiplier ELECTRONICS LETTERS 12th May 2005 Vol. 41 No. 10

Upload: murali

Post on 29-May-2017

227 views

Category:

Documents


1 download

TRANSCRIPT

Low-power parallel multiplier with columnbypassing

M.-C. Wen, S.-J. Wang and Y.-N. Lin

A low-power parallel multiplier design, in which some columns in the

multiplier array can be turned-off whenever their outputs are known, is

proposed. This design maintains the original array structure without

introducing extra boundary cells, as was the case in previous designs.

Experimental results show that it saves 10% of power for random

input. Higher power reduction can be achieved if the operands contain

more 0’s than 1’s.

Introduction: Multiplication is an essential arithmetic operation for

common DSP applications, such as filtering and fast Fourier transform

(FFT). To achieve high execution speed, parallel array multipliers are

widely used. These multipliers tend to consume most of the power in

DSP computations, and thus power-efficient multipliers are very

important for the design of low-power DSP systems.

CMOS is currently the dominant technology in digital VLSI. Two

components contribute to the power dissipation in CMOS circuits. The

static dissipation is due to leakage current, while dynamic power

dissipation is due to switching transient current as well as charging and

discharging of load capacitances. Since the amount of leakage current is

usually small, the major source of power dissipation in CMOS circuits is

the dynamic power dissipation. Dynamic power dissipation appears only

when a CMOS gate switches from one stable state to another. Thus, the

power consumption can be reduced if one can reduce the switching

activity of a given logic circuit without changing its function.

Many low-power multiplier designs can be found in the literature. A

straightforward approach is to design a full adder (FA) that consumes

less power [1]. Power reduction can also be achieved through structural

modification. For example, rows of partial products can be ignored [2].

Parallel multiplier: Consider the multiplication of two unsigned n-bit

numbers, where A¼ an�1 an�2, . . . , a0 is the multiplicand and B¼

bn�1 bn�2, . . . , b0 is the multiplier. The product P¼ p2n�1p2n�2, . . . , p0,

can be written as follows:

P ¼Xn�1

i¼0

Xn�1

j¼0

ðai � bjÞ2iþj

An array implementation, known as the Braun multiplier [3], is

shown in Fig. 1. On the other hand, the Baugh-Wooley multiplier uses

the same array structure to handle 2’s complement multiplication, with

some of the partial products replaced by their complements. The

multiplier array consists of (n� 1) rows of CSA, in which each row

contains (n� 1) FA cells. Each FA in the CSA array has two outputs:

the sum bit goes down while the carry bit goes to the lower-left FA. For

an FA in the first row, there are only two valid inputs, and the third input

bit is set two 0. Therefore, it can be replaced by a two-input half-adder.

The last row is a ripple adder for carry propagation. In this Letter, we

propose a low-power design for this multiplier.

Fig. 1 4� 4 Braun multiplier

Low-power multipliers with row-bypassing: A low-power multiplier

design may disable the operations in some rows to save power [2]. If bit

bj is 0, all partial products aibj, 0� i� n� 1, are zero. Therefore, the

additions in the corresponding row in Fig. 1 can be bypassed. The row-

bypassing multiplier is shown in Fig. 2. Each cell in the CSA array is

augmented with three tri-state gates and two multiplexers. For exam-

ple, let b2 be 0 in Fig. 2. In this case, the CSA in the second row

(enclosed in the circle) can be bypassed, and the outputs from the first

row are fed directly to the third row CSA. However, since the rightmost

FA in the second row is disabled, it does not execute the addition and

thus the output is not correct. To remedy this problem, an extra circuit

must be added, and these elements locate in the triangle area in Fig. 2.

P7

a b1 3

P6 P5 P4 P3 P2 P1 P0

a b2 3a b3 3 a b0 3

a b3 2 a b2 2 a b1 2 a b0 2

a b3 1 a b2 1 a b1 1 a b0 1

a b0 0a b1 0a b2 0a b3 0

+ + + +

+ + + +

+ + + +

+ + +0 0 0

0

0b3

–b2

b2

-b101 1001 1001 10

01 10 01 1001 10

01 10 01 1001 10

0 0 0

-b3

Fig. 2 4� 4 Braun multiplier with row-bypassing

Proposed method: Instead of bypassing rows of full adders, we

propose a multiplier design in which columns of adders are bypassed.

In this approach, the operations in a column can be disabled if the

corresponding bit in the multiplicand is 0. There are two advantages

to this approach. First, it eliminates the extra correcting circuit as

shown in Fig. 2. Secondly, the modified FA is simpler than that used

in the row-bypassing multiplier.

Assume that we execute 1010� 1111 in Fig. 1. It can be verified that,

for FAs in the first and third diagonals, two out of the three input bits are

0: the ‘carry’ bit from its upper right FA, and the partial product aibj

(note that a0¼ a2¼ 0). As a result, the output carry bit of such an FA is

0, and the output sum bit is simply equal to the third bit, which is the

‘sum’ output of its upper FA. The following theorem shows that this is

true in general. Therefore, when ai is 0, the operations in the correspond-

ing diagonal can be disabled since all the outputs are known. We refer to

the FAs in a diagonal in Fig. 1 as a column. Let FAi, j be the full adders

locating in row i and column j, 0� i, j� n� 2, in the (n� 1)� (n� 1)

array, as shown in Fig. 1. FA0,0 is the adder at the upper-right corner. The

following theorem establishes reason for column bypassing.

Theorem 1: When aj¼ 0, the output of a column j adder cell FAi, j can

be specified as follows. 1. The output carry bit is 0. 2. The output sum

bit is equal to the output sum bit of FAi�1, jþ1.

Proof: We prove this theorem by induction.

1. Consider row 0. Note that, in row 0, there are only two bits to be

added. Adder FA0, j carries out ajb1þ ajþ1b0. If aj¼ 0, then the output

carry bit must be zero, and the out sum bit is equal to ajþ1b0.

2. Assume that the theorem holds for row i.

3. In row iþ 1, the inputs of FAiþ1, j are carry bit from FAi, j, sum bit from

FAi, jþ1, and the partial product ajbiþ1. Since aj¼ 0, two out of the three

inputs are 0, and the output sum bit is equal to the sum bit sent by FAi, jþ1.

According to theorem 1, when aj¼ 0, the operations in column j can

be ignored and thus the full adders can be disabled since the outputs are

known.

a b3 3

P7 P6 P5 P4 P3 P2 P1 P0

+ + +

+ + +

+ + +

+ + +

a b2 3

a2 a1 a0

10 10 10

10 10 10

10 10 10

a b1 3 a b0 3

a b3 2 a b2 2 a b1 2 a b0 2

a b3 1 a b2 1 a b1 1

a b3 0 a b2 0 a b1 0 a b0 0a b0 1

Fig. 3 4� 4 column-bypassing multiplier

ELECTRONICS LETTERS 12th May 2005 Vol. 41 No. 10

Multiplier design: The column bypassing multiplier is shown in

Fig. 3. Note that we only need two tri-state gates and one multiplexer

in a modified adder cell. If aj¼ 0, the FA will be disabled. We do not

need a tri-state gate for the carry input (Ci�1, j), and the reason is given

as follows. For a Braun multiplier, there are only two inputs for each

FA in the first row (i.e. row 0). Therefore, when aj¼ 0, the two inputs

of FA0, j are disabled, and thus its output carry bit will not be changed.

Therefore, all three inputs of FA1,j are fixed, which prohibits its output

changing. In the bottom of the CSA array, we need to set the carry

outputs to be 0. Otherwise, the corresponding FAs may not produce

the correct outputs since their inputs are disabled. This is done by

adding an AND gate at the outputs of the last-row CSA adders.

Results: To evaluate the performance of this low-power multiplier, we

implement the design with TSMC 0.35 mm technology. We compare

the performance of this design with a normal Braun multiplier and row-

bypassing multiplier [2]; the results are given as follows. Table 1 gives

the power consumption by the three designs. In this experiment, the

input patterns are assumed to be random, i.e. the probability of 0 and 1

are both 0.5. The power is estimated by running HSPICE. Note that this

is a relatively pessimistic estimation. If the operands are sparse (i.e. the

number of 0’s is more than 1’s), there will be greater power saving. Our

results show that the row-bypassing multipliers actually consume more

power, possibly due to the extra logic. Our design consumes less power

in all cases, and the reduction increases as the size becomes larger. If

the distribution of 0’s and 1’s is not uniform, we shall be able to achieve

higher power saving. The areas of the three designs are listed in Table 2.

In our design, the area overhead is roughly 20%, while the area

overheads of row-bypassing multipliers are more than 40%.

Table 1: Power (mWatt)

Multiplier typeSize

4� 4 (%) 8� 8 (%) 16� 16 (%)

Braun 0.4325 100 2.31 100 8.01 100

[2] 0.5537 128 2.76 119 8.26 103

Proposed 0.4298 99.4 2.25 97.4 7.15 89.3

Table 2: Area (mm2)

Multiplier typeSize

4� 4 (%) 8� 8 (%) 16� 16 (%)

Braun 8672 100 33286 100 131040 100

[2] 13692 158 48991 147 185367 141

Proposed 10063 116 40236 121 162131 124

Conclusion: We have presented a new low-power parallel multiplier

design, which disables the operations in columns of full adders.

Compared with row-bypassing, this technique achieves higher

power reduction with lower hardware overhead.

# IEE 2005 2 February 2005

Electronics Letters online no: 20050464

doi: 10.1049/el:20050464

M.-C. Wen, S.-J. Wang and Y.-N. Lin (Department of Computer

Science, National Chung-Hsing University, 250 Kuo-Kuan Road,

Taichung 40227, Taiwan)

E-mail: [email protected]

References

1 Wu, A.: ‘High performance adder cell for low power pipelinedmultiplier’. Proc. IEEE Int. Symp. on Circuits and Systems, May 1996,Vol. 4, pp. 57–60

2 Ohban, J., Moshnyaga, V.G., and Inoue, K.: ‘Multiplier energy reductionthrough bypassing of partial products’. Proc. Asia-Pacific Conf. onCircuits and Systems, 2002, Vol. 2, pp. 13–17

3 Abu-Khater, I.S., Bellaouar, A., and Elmasry, M.: ‘Circuit techniques forCMOS low-power high-performance multipliers’, IEEE J. Solid-StateCircuits, 1996, 31, (10), pp. 1535–1546

ELECTRONICS LETTERS 12th May 2005 Vol. 41 No. 10