2’s complement computation sharing multiplier

458 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 51, NO. 2, FEBRUARY 2003

Two’s Complement Computation Sharing Multiplierand Its Applications to High Performance DFE

Hunsoo Choo, Khurram Muhammad, and Kaushik Roy, Fellow, IEEE

Abstract—We present a novel computation sharing multiplierarchitecture for two’s complement numbers that leads to highperformance digital signal processing systems with low powerconsumption. The computation sharing multiplier targets thereduction of power consumption by removing redundant compu-tations within system by computation reuse. Use of computationsharing multiplier leads to high-performance finite impulseresponse (FIR) filtering operation by reusing optimal precom-putations. The proposed computation sharing multiplier can beapplicable to adaptive and nonadaptive FIR filter implementation.In this paper, a decision feedback equalizer (DFE) was imple-mented based on the computation sharing multiplier in a 0.25-technology as an example of an adaptive filter. The performanceand power consumption of the DFE using computation sharingmultiplier is compared with that of DFEs using a Wallace-tree anda Booth-encoded multiplier. DFE implemented with computationsharing multiplier shows improvement in performance overthe DFE using a Wallace-tree multiplier, reducing the powerconsumption significantly.

Index Terms—Baugh–Wooley algorithm, carry–save optimiza-tion, computation sharing multiplier, decision feedback equalizer.

I. INTRODUCTION

W ITH the rapid growth of the consumer electronics, thedemand for signal processing system with very high

data throughput has increased more than ever. To satisfy therequired throughput and power dissipation, application specificintegrated circuits are preferred over the use of digital signalprocessing (DSP) cores. Hence, efficient design methods forsignal processing circuits are required.

Finite impulse response (FIR) filtering is most widely usedoperation in DSP systems. Much work has been done to re-duce the complexity of FIR filter. Most of the work [1]–[6] fo-cused on efficient filter structure with fixed coefficients. Canon-ical signed digit (CSD) [2], [5] is used to reduce the number ofthe required additions and subtractions for filtering operation byreducing the total number of nonzero bits in coefficients. Themultiple constant multiplication (MCM) problem [6] was alsoexplored to optimize multiplication algorithm in filtering oper-ation. Elimination of common subexpression [4] is also inves-

Manuscript received March 8, 2001; revised September 12, 2002. This workwas supported in part by DARPA. The preliminary paper appeared in thePro-cedings of the International Conference on Acoustics, Speech and Signal Pro-cessing, 2001. The associate editor coordinating the review of this paper andapproving it for publication was Dr. Olivier Cappe.

H. Choo and K. Roy are with the School of Electrical and Computer En-gineering, Purdue University, West Lafayette, IN 47907-1285 USA (e-mail:[email protected]; [email protected]).

K. Muhammad is with Texas Instruments, Dallas, TX 75243 USA (e-mail:[email protected]).

Digital Object Identifier 10.1109/TSP.2002.806984

tigated to design a filter structure with less computation at thecost of little increase in complexity.

However, the research mentioned above mainly focused oncomplexity reduction of multiplications by multiple constants,and it is difficult to apply these techniques to general filteringoperations with varying coefficients like adaptive filters. In ourwork, we focus on reducing the computation complexity of fil-tering operation by reducing the complexity of general multipli-cation without much cost in design metrics. We present acom-putation sharing multiplier(CSHM) architecture, which iden-tifies common computations and shares them between differentmultiplications [7]. We modified it so that CSHM can work withsign magnitude and two’s complement numbers. Because com-putation sharing technique deals with general multiplications,this can be applied to both adaptive and nonadaptive filters.

The carry–save optimization technique was proven to bea powerful scheme for addition of multi-operands. In ourresearch, carry–save optimization technique is generalized andreferred to as a carry–sumdual number representation(DNR),which uses the carry–save adder structure and generates theoutput in two intermediate signals without calculating finalcomputation result.

To verify improvement in speed by computation sharing tech-nique, we implement a decision feedback equalizer as an ex-ample of an adaptive filter. Several equalizing algorithms existin the literature (e.g., [8]–[10]). Among them, the maximumlikelihood estimationViterbi algorithm achieves the best per-formance. However, the architecture for such an algorithm iscomplex and is difficult to implement, especially for channelswith a long impulse response [10]. On the other hand, thedeci-sion feedback equalizer(DFE) has a much simpler architecture,and it shows relatively comparable performance to the Viterbiequalizer. This has led to the investigation of the DFE architec-ture [11], [12]. The DFE is composed of two linear filters (feed-back and feedforward filter) and a nonlinear decision unit. Theorder of a filter depends on the symbol rate and channel charac-teristics. As the symbol rate increases, a faster processing speedof the DFE with low power consumption is desired. Hence, thereis a need to investigate more efficient architecture and algorithmfor DFE.

In the implementation of DFE, all the multipliers are replacedwith computation sharing multipliers that reduce the computa-tion complexity and if the function is not affected, most of theoutputs from function modules are represented in carry–sumdual number reducing the total delay. As a result, the imple-mented DFE is composed of FIR filters with lower complexityand higher performance, leading to faster operating speed at alower power dissipation.

1053-587X/03$17.00 © 2003 IEEE

Authorized licensed use limited to: Sri Venkateswara College of Engineering and Tech. Downloaded on March 3, 2009 at 21:47 from IEEE Xplore. Restrictions apply.

CHOOet al.: TWO’S COMPLEMENT COMPUTATION SHARING MULTIPLIER 459

The rest of this paper is organized as follows. In Section II,the vector scaling operation and the computation sharing algo-rithm are described. The architecture of CSHM using sign-mag-nitude numbers is presented in Section III. Section IV explainsthe algorithm and the architecture of modified CSHM for two’scomplement numbers. The implementation of DFE is presentedin Section V. Sections VI and VII present the numerical resultsand conclusions, respectively.

II. BASIC ALGORITHM OFCOMPUTATION SHARING MULTIPLIER

A. Vector Scaling Operation

Generally, a large number of DSP computations can be de-scribed as the multiplication of two matrices. For example, therelationship between the input and the output of alinear time-in-variant (LTI) FIR filter with length can be described as

(1)

where is th coefficient, and is the sample of the inputdata at time instance . Equation (1) can be represented asfollows:

(2)

where is the column vector,, is composed of column

vectors whoseth column vector, isfor , and

is the column vector with coefficients of the filter as its element.

Equation (3) describes the transposed direct-form FIR filter.

(3)

where is a matrix whose th row consists ofzeros followed by the elements of the vectorfollowed by

zeros for , and is a columnvector .

In (2) and (3), the column vectorand are the commonfactors that are used to compute the output at everytime step. Hence, matrix multiplication can be considered as aproduct of a varying vector and a constant vector. We refer tosuch an operation as thevector scaling operation[7].

Expressing equations into one vector scaling operation allowsus to identify common factors and share them between severalcomputations, as is evident from the following discussion.

B. Computation Sharing Algorithm

1) Reduction of Computational Redundancy:The most fun-damental computation of the vector scaling operation is the mul-tiplication of one scalar and one vector . Suppose the scalar

can be decomposed to smaller bit sequences, and canbe rebuilt from these numbers by few shifts and adds. Then,

can be expressed as . We refer to these smallerbit sequence as thealphabet,and the set of alphabets as thealphabet set.

The multiplication can now be represented as

(4)

If we have computed values of in advance, the multiplica-tion of (4) is significantly simplified to only few shifts and adds.The additional computation for the precomputed valuesare referred to asprecomputation.

For instance, can be decomposed to. If both and are precomputed and available, the

entire multiplication process can be replaced with four bit shiftof and the addition of two numbers. Here, “11” and “1” areused as the alphabets, and the alphabet set should include “11”and “1” as its element.

2) Optimal Alphabet Set:Depending on the alphabet set,the number of required shifts and adds is determined. Hence,we have to choose the alphabet set for the computation sharingamongst multiplications to be maximized. The maximal com-putation sharing occurs when we use an alphabet set that min-imizes the overall number of distinct precomputations requiredto compute the vector scaling operation .

The optimal alphabet set should satisfy the following proper-ties to obtain themaximal computation sharing.

• It covers all the elements of vector.• It minimizes the number of alphabets in the set.• The total number of add operations in the decompositions

of all the elements of vector has to be minimized.The first condition is always true for any alphabet set. The

second condition is related to the overhead for the precomputa-tion. As the number of alphabets increase, so does the amount ofprecomputation as well as numbers of buses and latches. Hence,the size of alphabet set has to be minimized. After the mul-tiplication is transformed into shifts and adds, the complexityof computation is mainly determined by addition because shiftscan be done simply by routing. This is related to the third con-dition.

Two approaches to determine the alphabet set are suggestedin [7]: the greedy approach and the fixed size look-up rule. Inthe fixed size look-up rule, the bit length of alphabet is fixedand all the odd numbers within the boundary are suggested asan optimal alphabet set.

An alphabet set is decided using the fixed size look-up (FSL)rule. This allows simpler multiplier architecture than the greedyapproach. For the 16 16 multiplier, the boundary is decidedas 4 bits, and the set1, 3, 5, 7, 9, 11, 13, 15is used as thealphabet set.

C. Carry–Sum Dual Number Representation

In DSP and communications applications, most of the opera-tions are the arithmetic computations like addition, subtraction,and multiplication. Hence, we can consider the DSP and com-munications processing as a sequence of arithmetic operations.



Fig. 1. Sequence of arithmetic functions. (a) Single number representation. (b)Carry–sum dual number representation.

Fig. 2. Carry–sum redundant number scheme.

In the hardware implementation, a straightforward way to im-plement the arithmetic functions is to compute the final resultby propagating carry. Hence, in a sequence of such operations,carry is propagated at the end of every operation. Fig. 1(a) showsthe idea behind the above implementation method. We refer toit assingle number representation.Alternatively, a relaxed rulecan be applied such that two numbers, the sum of which is equalto the final result, can be generated as two outputs of an arith-metic operation. Fig. 1(b) shows the idea behind this method.

For example, consider filtering operation where an accumu-lator follows a multiplier. The carry–save multiplier is used forthe multiplication and carry–save adder is used as an accumu-lator.

The carry–save multiplier can be decomposed into three func-tion blocks: partial products generation, carry–save adder block,and vector merger adder (VMA). Suppose the carry–save adderblock adds all partial products and generates two signals repre-senting 7 and 9 (see Fig. 2). The VMA adds these two final out-puts using carry propagation to generate the end result 16. If 3 issubsequently added, another carry propagation must completeto obtain the final result. In this approach, the delay of VMAcontributes a large part of total delay since a carry has to propa-gate from LSB to MSB in VMA. Alternatively, the three valuesmay be added without propagating carry in the intermediate sum(of 7 and 9) to which the third number (3) is added. This removesa VMA delay from the overall operation. This principle is usedin constructing array multipliers. We will refer to this schemeas carry–sumdual number representation(DNR).

Fig. 3. Computational sharing multiplier architecture with FSL rule.

Dual number representation allows great flexibility in systemdesign. Depending on the situation, we can generate final com-putation result in single complete signal or in two intermediatesignals. These two formats are equivalent as long as there isno loss of information within the bit level. As a consequence,the data path over several function modules can be optimizedleading to high performance. We introduce the concept of DNRinto the implementation of computation sharing multiplier andDFE to achieve higher performance.

III. COMPUTATION SHARING MULTIPLIER ARCHITECTURE

A. Computation Sharing Multiplier Algorithm

The computation sharing multiplier (CSHM) architecture isshown in Fig. 3. The CSHM is composed of two sub-blocks,which are theprecomputerand theselect/shiftandadder unit.The precomputer computes the multiplications between the pre-selected alphabets and the varying vector (). The select/shiftand adder unit (SSA) chooses proper values from the precom-puted values, shifts them, and adds them to obtain the final com-putation result.

Fig. 3 shows the basic architecture of the 88 computationsharing multiplier. Alphabets are preselected, and those alpha-bets are represented as unsigned number with a 4-bit length.

The precomputer computes the multiplication of alphabetsand input vector in advance and stores them for reuse. TheSSA consists of fewselect/shiftsandfinal adders.When a scalarcomes in, the scalar is divided into smaller bit sequences thathas the same length as that of an alphabet. We refer to the bit se-quence as the subnumber. The SHIFT unit right-shifts the sub-number of to find the matching alphabet and generates controlsignals, like theindex signal, to the mux and theshift signaltothe ISHIFT. The 8 : 1 mux selects one precomputed value based



Fig. 4. Implementation of computational sharing multiplier.

on the index signal, and ISHIFT left-shifts the selected valueby the 8 : 1 mux. Each select/shift performs the same operationwith different subnumbers of . Finally, the output values fromselect/shifts are properly shifted and added at the adders to gen-erate the final multiplication result.

Suppose we multiply a vector and a scalar 11 100 100. Analphabet set 1, 3, 5, 7, 9, 11, 13, 15is predetermined, andeach alphabet is represented by a 4-bit unsigned number. Theprecomputer computes the multiplication of the input vectorand every alphabet and reserves the computation results in 12bits. When a scalar comes into the multiplier, the scalar is par-titioned into two 4-bit sequences simply by routing. Each par-tition of the scalar goes to the SHIFT of the select/shift. Twoselect/shifts are used to deal with the MSB part and LSB part inparallel. The SHIFT generates appropriate index signal and shiftsignal to select a properprecomputed value.In this example, theSHIFTof the upper select/shift chooses , and the lower se-lect/shift chooses . Then, and are left-shifted by 2bits and 1 bit, respectively. As a result, we obtain and

with a 12 bit length from each select/shift. At last, whatis left is the addition of the two numbers after proper shift oper-ation at the final adders. Each of the two select/shift units workon and bit of input. Hence, we have to shift

to the left by 4 bits, but this overhead is negligible be-cause this can be done simply by routing without paying anycost. Finally, the final addition of the two numbers gives us themultiplication result.

B. Implementation of Computation Sharing Multiplier forSign-Magnitude Numbers

We implemented a 16 16 computation sharing multiplierfor sign-magnitude numbers (CSHM-SM) (Fig. 4). The archi-tecture can be recognized from Fig. 3. The alphabet set1, 3, 5,7, 9, 11, 13, 15 is used, and each alphabet is represented by a4–bit unsigned number.

Fig. 5 shows the abstract structure of the precomputer. Eachprecomputation is performed by few shifts and adds. For ex-

Fig. 5. Precomputer structure.

Fig. 6. Implementation of precomputer. (a)5X = 100X(X � 2) +X . (b)7X(111X) = 1000X(X � 3)�X = (X � 3) +X + 1.

ample, is obtained from the addition of and one bit shifted. Fig. 6 shows the detailed implementation of and .Four select/shifts are used for parallel processing. We use a

tristate buffer instead of a mux to reduce the power dissipationby eliminating the unnecessary capacitive load of the intercon-nect. In addition, the alphabet set does not include zero alphabet.Hence, we simply add a 2 : 1 mux to deal with the zero coeffi-cient case. Fig. 7 shows the implementation of a select/shift unit.

At the very last stage of multiplication, 24– and 32–bit finaladders are required to compute the final result. The delay ofselect/shift is small because it conducts only a shift operation.Therefore, the bottleneck of the SSA is the final adder. We usea square root carry–select adder [13] to boost the speed of theaddition.

IV. M ODIFIED CSHM FOR TWO’S COMPLEMENT

In Section III-B, we considered sign magnitude numbers.However, the use of two’s complement numbers can bepotentially more efficient than sign magnitude numbers forcertain applications. We introduce modified CSHM for two’scomplement numbers using theBaugh–Wooley algorithmandthe carry–sum dual number representation. Depending on the



Fig. 7. Implementation of select/shift unit.

level of carry–sum DNR usage, two multiplier structures canbe considered: one using carry–sum DNR fully inside themultiplier and the other not using carry–sum DNR inside themultiplier but generating output in dual numbers. They are re-ferred to ascomputation sharing multiplier using dual numberrepresentation(CSHM-DNR) andcomputation sharing multi-plier using two’s complement(CSHM-TC), respectively. Eventhough the architecture of CSHM-DNR is presented in thispaper, the architecture of CSHM-TC can be recognized easily.

A. Multiplication Algorithm for Two’s Complement

The advantage of the CSHM comes from the use of simpleshift operations. For sign magnitude numbers, the shift opera-tions only changes the number by powers of two. Suppose wecompute the multiplications of two numbers and ,where and are only different by powers of two. Then, wecan easily obtain the partial products of by shifting thepartial products of . If we know all the partial products of

in advance, the only cost to compute the partial productsof are a few shift operations. This was the basic idea ofCSHM architecture. However, in the case of two’s complementnumbers, the shift operation makes the number totally different.As a consequence, we have to use a special multiplication algo-rithm for two’s complement numbers, where the shift propertycan be exploited. A useful algorithm for the two’s complementmultiplication, which is referred to asBaugh–Wooley algorithmis introduced in [14] and [15].

In general, consider an -bit binary number represented bytwo’s complement as . Mathemati-cally, the magnitude of can be expressed as

(5)

Fig. 8. Multiplication algorithm for two’s complement numbers:Baugh–Wooley algorithm.

Then, the product of two two’s complement numbers is ex-pressed as

(6)

Equation (6) shows that the multiplication of two’s complementnumbers can be written in a form that involves only bit productsand the inversion of bit products. The result of the final productcan be obtained by adding these bit products. Fig. 8 shows howthis algorithm works in the case of a 44 multiplication.

The first three rows are the partial products generated by oneNAND and threeAND operations. The fourth row is the partialproduct generated by oneAND and threeNAND operations witha sign bit. We call the first three partial products as partial prod-ucts with magnitude part (PM) and the fourth one as partialproducts with sign bit (PS). The partial products of PM are gen-erated by the same operation. Hence, the shift operation can beused to get the partial products of different bit-levels. For ex-ample, suppose in Fig. 8; then, the first and third rowsof PM are composed of the same bit products, and the third rowcan be obtained by shifting the first row by 2 bits. Similarly, theshift property can be used to compute the partial products con-sisting of PM.

B. Computation Sharing Multiplier with Dual NumberRepresentation

Using the Baugh–Wooley algorithm, we can devise a multi-plication method for two’s complement numbers where the shift



Fig. 9. Architecture of computational sharing multiplier using dual numberrepresentation.

property can be exploited as in the sign magnitude multiplica-tion. We suggest a modified architecture of the CSHM usingthe Baugh–Wooley algorithm and carry–sum dual number rep-resentation (DNR). Fig. 9 shows the basic architecture of theCSHM-DNR. According to the Baugh–Wooley algorithm, PMand PS are necessary to obtain the final multiplication result. InCSHM-DNR, the precomputer and theSelect& Shift(S&S) takepartial responsibility for computing PM and PS, respectively.The precomputer generates the precomputed values which areused to build PM of the target multiplication. The S&S uses theprecomputed values to generate PM and also generates PS. Theadditions of two 1s at and bit level are integratedinto the final adder structure.

We use the same alphabets and the 4-bit unsigned numberrepresentation. Like the case of CSHM, the precomputer gener-ates all the partial products with input and alphabets, and addsthem. OneNAND and severalAND bit operations are used to gen-erate partial products that correspond to PM. The summation ofall partial products are represented in carry–sum numbers andstored in the precomputer for reuse. Hence, the number of pre-computer outputs is double that of CSHM. These outputs comeinto the four S&S units as inputs.

The S&S consists of fourSelect/Shift units and PS generator.The scalar comes as an input to the CSHM-DNR. The scalar isdivided into subnumbers having the same length as the alpha-bets excluding the sign bit. Depending on these subnumbers ofthe scalar, each Select/Shift unit selects proper C (carry) and S(sum) numbers and shifts them to generate correct values cor-responding to the partial products of PM. Unlike the CSHM,the select shift units generate one more output: thecorrectionvalue.The correction values are required due to theNAND op-eration, which is used when generating the partial product of

Fig. 10. Example of CSHM-DNR computation:�43(1010101) �

�26(1100110).

PM. For example, suppose we have already computed the PMof , where is (see Fig. 8), and want to computethe multiplication of , where is . According tothe Baugh–Wooley algorithm, the partial products generated by

and can be obtained by shifting the first and second partialproducts of . Additionally, “1 0 0 0” has to be added to com-pute the correct result, which is generated by “0” at LSB. Hence,a correction term is required to compensate for this additionalvalue. The S&S also generates the PS of the Baugh–Wooley al-gorithm. For that, oneAND and severalNAND bit operations witha sign bit of scalar are performed. Finally, all the outputs fromS&S are generated. These are inserted into the carry–save treeadder (CSTA), which is the last stage of the multiplier. The ad-ditions of and of the Baugh–Wooley algorithm areintegrated into this CS tree adder.

Fig. 10 shows an example of CSHM-DNR multiplication pro-cedure: . Here, 43 is the input . In this ex-ample, each number is represented as a 7-bit two’s complementnumber. We use1, 3, 5, 7 as alphabets that are represented in3 bits. In the precomputer, and are precom-puted and stored as carry–sum dual number. Fig. 10 shows the

and precomputation. When the scalar26 comes in,6 bits, except for the sign bit, are divided into two subnumbers,which are (1 0 0) and (1 1 0). (1 0 0) is two bit shifted number ofone (0 0 1), and (1 1 0) is one bit shifted number of three (0 1 1).Hence, each S&S selects and shifts C and S ofand , de-pending on the subnumber. Depending on the number of shifts,correction values are generated by S&Ss. For one bit shift, cor-rection value is (0 0 1), and (0 1 1) is the correction value gen-erated for two bit shift. The S&S also generates the PS. Finally,all the generated numbers are added after proper shift operation.This addition is performed in the carry–save tree adder.



Fig. 11. Precomputer structure when using carry–sum DNR.

1) Implementation:We implemented a 17 17 multiplier,generating output in carry–sum dual number format. We used0.25 technology.

The use of carry–sum dual number representation booststhe speed of precomputer. In the case of CSHM, the delay ofprecomputer is equal to several full-adder delays. If we usecarry–sum DNR, the maximum delay of the precomputer isreduced to one full-adder and one inverter delay. Fig. 11 showsthe structure of the precomputer. Considering the binary num-bers of 1, 3, 5 and 9, they have only one or two 1s in their binaryrepresentation. If we generate output in dual number format, noaddition is required. Fig. 12(a) shows that shifting is enough toimplement and . For andimplementation, only one full-adder and one inverter delay isrequired. There are only three 1s in the binary representationsof 7, 11, 13, and 15. Hence, three operands have to be addedfor precomputation of and . After onefull-adder addition, the number of operands are reduced to two.These two numbers are used as redundant number outputs. Foran example, Fig. 12(b) and (c) shows the implementation of

and . It is evident that we can achieve higher speedand lower power precomputer using DNR.

Each Select/Shift unit of the CSHM-DNR is composed oftwo select/shifts of the CSHM and a corrector (see Fig. 13).Each select/shift deals with and of the precomputed valuesin parallel. The control unit performs the same function as theSHIFT of the CSHM. Zero or shifted and are generated bythe ISHIFT, depending on zero and shift signals. However, thetwo ISHIFT of the S&S generate different numbers whenzerosignal is “1.” Considering the Baugh–Wooley algorithm, whenthe input is “ ,” the sum of PM is “ .”In the implementation, ISHIFT_1 generates _ ,”and ISHIFT_2 generates “ .”

The corrector generates a correction value, and Table I showsthe correction values based on shift signal. CSTA follows theS&S. Fig. 14 shows the structure of the CSTA. At every adderstage, as many compressors or full-adders as possible are ar-ranged to reduce the number of signals at every bit level. Thisprocedure is repeated until only two signals are left. These twosignals are used as output numbers.

The architecture of CSHM-TC is easily obtained fromCSHM-SM and CSHM-DNR. In the implementation ofCSHM-TC, the precomputer computes all required precompu-tations and provides them in a single number like CSHM-SM.The architecture of S&S of CSHM-TC is same as that ofCSHM-DNR, but each Select/Shift unit of CSHM-TC includesonly one select/shift of the CSHM. Hence, the number ofoperands to CSTA is same as that of CSHM-SM.

Table II shows the performance and the area of CSHM-DNRand CSHM-TC. It shows that DNR reduces the delay of theprecomputer significantly but decreases performance of S&S alittle bit. As a result, the total delay of CSHM-DNR decreasesby 26% at the cost of increasing area. However, if the precom-puter is excluded from our consideration, the performance ofCSHM-TC is a little better than that of CSHM-DNR.

Fig. 15 shows the architecture of FIR filter based on CSHM.If conventional multipliers (Wallace multiplier, Booth-encodedmultiplier etc.) are used for filter implementation, flip-flops forpipelining should be placed in every tap in FIR filter. How-ever, pipelining of filter using CSHM can be simply done byplacing flip-flops right after precomputer due to computationsharing and reuse. Therefore, the cost of pipelining (the numberof flip-flops) is much smaller than using conventional multi-pliers. Because pipelining is allowed in most applications, com-parison between conventional multiplier and S&S and adderblock of CSHM is considered reasonable.

V. IMPLEMENTATION OF DFE USING COMPUTATION SHARING

MULTIPLIER FOR TWO’S COMPLEMENT NUMBERS

The target application of CSHM is distributed multiplica-tions, and FIR filtering operation is one such application. In thissection, we present the usage of CSHM for FIR filter implemen-tation. Even though CSHM can be used for adaptive and non-adaptive FIR filter operations DFE is selected as an example ofadaptive filter and implemented based on CSHM.

We considerminimum mean square error(MMSE) DFEusing LMS algorithm for adaptation. The decision feedbackequalization can be expressed as

(7)

where and are the coefficients of theth and the thfilter tap of the feedforward and the feedback filter in theth it-eration, and and are the inputs to theth feedforwardfilter tap and the th feedback filter tap.

(8)

is the error defined as the difference between theth trans-mitted symbol and its corresponding estimateat th iter-ation. In addition, the LMS algorithm [12] is expressed as

(9)



Fig. 12. Implementation of precomputer.

where is the vector of the equalizer coefficients in thethiteration, is the signal vector of the signal samples stored inthe FIR filter in the th iteration, and is the step size parameterthat controls the rate of adjustment.

In (7), each sum of the weighted input vector describes theFIR filter. Each has and weighted symbols that composethe feedforward FIR filter and the feedback FIR filter, respec-tively.



Fig. 13. Select & Shift unit implementation.

TABLE ICORRECTIONVALUES

Fig. 14. Carry–save tree adder structure.

As the data transmission rate goes up, the ISI of the signalbecomes severe. As a consequence, the order of two FIR filtersused in DFE has to increase, and the computation complexity ofthe filters also increases. In addition, higher operating speed ofDFE is also required. By using the sharing multiplier presentedin the previous sections, we can solve the two problems andachieve high-performance DFE.

TABLE IIDELAY AND AREA OF CSHM FOR TWO’S COMPLEMENT NUMBERS

Fig. 15. FIR filter implemented with CSHM-DNR.

The FIR filter can be implemented in the direct form or in thetransposed direct form. For an adaptive filter, the direct-formFIR filter is used. In DFE, the change of coefficients in thethiteration must be reflected in the output of feedforward and feed-back FIR filter within the th iteration. Thus, any delay compo-nent through the feedback loop needs to be avoided at all cost.Therefore, the filter structures with a time delay like the trans-posed direct form has to be avoided.

If we use the CSHM for FIR filter design, the precomputer isat the input, and all the multipliers in the filter are replaced withthe S&S and adder blocks. Only one precomputer is requiredbecause all the S&S blocks share the precomputation results ofa single precomputer.

Fig. 16 describes the detailed architecture of the DFE, andFig. 17 shows the implementation of LMS algorithm. We re-place the multipliers of feedforward and feedback filters withCSHM for two’s complement and the precomputer. The com-putation of the precomputer is shared by all S&Ss in each filter.The carry–save tree adder following the S&S adds all the out-puts from the S&S and generates the output result in carry–sumdual number. Five filter tap FIR filters are used for feedforwardand feedback filters in the implementation. There are 20 outputsfrom both filter taps. In the implementation, the accumulatorsof each FIR filter are integrated into one large carry–save treeadder. The integrated accumulator also generates output in dualnumber format. We insert the vector merger adder that computesthe final result only when it is required.

VI. RESULTS

We present the simulation results on multipliers and DFE.For comparison, different multipliers (Wallace-tree multiplier,Booth-encoded multiplier, CSHM-TC, and CSHM-DNR) areimplemented, and DFEs are implemented based on them. The



Fig. 16. DFE implementation with CSHM using dual number representation.

Fig. 17. Implementation of LMS algorithm.

post-layout simulation results show that the proposed multi-pliers are more efficient than other conventional multiplier whenapplied to distributed multiplications with a common input. Be-cause FIR filtering operation is a proper example to show the ef-fect, we take the DFE, including two FIR filters, as an exampleapplication. Powermill and pathmill are used for the power anddelay estimation, respectively.

Fig. 18 summarizes the performance (speed) and area fordifferent multipliers (in isolation). Depending on application,increased latency may not be acceptable. Hence, in case ofCSHM, two kinds of delay and area results per multiplier are

Fig. 18. Performance and area comparison between multipliers.

computed. If precomputer is included in multiplier delay, thereis no benefit of using CSHM in terms of delay and area. How-ever, if the application allows CSHM to be pipelined by placingflip-flops right after the precomputer, the speed of CSHM-TCand CSHM-DNR improves by approximately 33 and 23%over that of Booth-encoded multiplier, respectively, at thecost of increased area (CSHM-TC: 6%, CSHM-DNR: 66%).Performance improvement comes for two reasons. One is thereduction in the number of operands. Because the carry–savetree adder dominates a large portion of the multiplier delay, asmaller number of operands is desirable for better performance.In case of CSHM-TC, the number of operands is only halfof Booth-encoded multiplier, resulting in reduced delay. Inaddition,carry–sum dual number representationshows betterperformance by removing carry propagation delay required byVMA. Of course, pipelining can be considered as one of the



Fig. 19. Performance and area comparison between DFEs using differentmultipliers.

Fig. 20. Power and PDP of DFEs.

main reasons for performance improvement. However, we willshow that in applications involving distributed multiplications,CSHM provides better architecture for pipelining, requiringless cost than conventional multipliers.

Fig. 19 shows the speed and area of DFEs implemented usingdifferent multipliers (Wallace, Booth-encoded, CSHM-TC andCSHM-DNR). DFE based on CSHM-TC and CSHM-DNR canbe operated at approximately 42 and 34% faster than DFE basedon Wallace-tree multipliers. When compared with the DFE thatis implemented with a Booth-encoded multiplier, the DFE usingCSHM shows almost comparable or little improved speed. Be-cause the critical path of DFE does not include the precom-puter, DFE using CSHM-TC shows better performance thanCSHM-DNR.

Fig. 20 describes the measured power consumption of DFEsboth at maximum performance and at the same clock speed of 55ns. At the maximum performance, the DFE using CSHM-DNRconsumes 10% more power over the DFE using Booth-encodedmultipliers. On the other hand, the DFE using CSHM-TC con-sumes 10% less power than the DFE based on Booth-encodedmultipliers. The power delay product (PDP) of the DFE usingCSHM-TC improves by 29% over the DFE using the Booth-en-coded multiplier. The PDP of the DFE using CSHM-DNR isalmost same as that of the DFE using the Booth-encoded multi-plier. Consider the architecture of CSHM-DNR and the Booth-encoded multiplier. Because we use the carry–sum DNR forCSHM-DNR, the number of operands that should be added atthe final carry–save tree adder is almost the same as that of theBooth-encoded multiplier. The difference is sharing precompu-tation between several multiplications. Hence, the overhead of

Fig. 21. Power estimation of FIR filter.

Fig. 22. Power consumption of multiplier at 55 ns CLK.

the precomputer and latches conceals the benefit of sharing mul-tiplier in small filters. However, as the number of filter taps in-creases, the benefit of sharing multiplier increases (Fig. 21). Ifthe DFEs are operated at same clock cycle (55 ns: the max-imum clock cycle of the DFE using the Wallace multiplier),CSHM-DNR and CSHM-TC reduce power dissipation of theDFE by 7 and 30%, respectively, when compared with the DFEusing the Booth-encoded multiplier.

Fig. 22 shows the power consumption of multipliers (in iso-lation) at 55 ns clock cycle. Excluding precomputer from con-sideration, CSHM-DNR and CSHM-TC shows a 17 and 40%improvement in power dissipation over the Booth-encoded mul-tiplier. However, if we include the precomputer, the power con-sumption of CSHM is larger than that of the Booth-encodedmultiplier. This explains why the computation sharing multi-plier is more suitable for applications involving distributed mul-tiplications. As a result, the benefit of CSHM is magnified inapplications such as FIR filtering, matrix multiplication, etc.

VII. SUMMARY

In this paper, we presented an architecture of the computa-tion sharing multiplier for two’s complement numbers, whichexploits the computation reuse and reduces the computationalcomplexity in the filtering operation. A decision feedbackequalizer is implemented based on the proposed multiplierusing carry–sum dual number representation. The DFE imple-mented with CSHM and dual number representation improves



performance by 10% over the DFE using the Booth-encodedmultiplier. On average, an 18% improvement in power con-sumption is also obtained by CSHM at 55 ns clock.

REFERENCES

[1] Y. C. Lim and S. R. Parker, “Fir filter design over a discretepowers-of-two coefficient space,”IEEE Trans. Acoust., Speech, SignalProcessing, vol. ASSP-31, pp. 583–591, June 1983.

[2] K. Hwang, Computer Arithmetic, Principles, Architecture, and De-sign. New York: Wiley, 1979.

[3] Q. Zhao and Y. Tadokoro, “A simple design of fir filters withpowers-of-two coefficients,”IEEE Trans. Circuits Syst., vol. 35, pp.566–570, May 1988.

[4] R. Paskoet al., “A new algrithm for elimination of common subexpres-sions,” IEEE Trans. Comput.-Aided Des. Integrated Circuits Syst., vol.18, pp. 58–68, Jan. 1999.

[5] H. Samueli, “An improved search algorithm for the design of multipli-erless fir filters with powers-of-two coefficients,”IEEE Trans. CircuitsSyst., vol. 36, pp. 1044–1047, July 1989.

[6] M. Potkonjak, M. B. Srivastava, and A. P. Chandrakasan, “Multipleconstant multiplications: Efficient and versatile framework and algo-rithms for exploring common subexpression elimination,”IEEE Trans.Comput.-Aided Des. Integrated Circuits Syst., vol. 15, pp. 151–165,Feb. 1996.

[7] K. Muhammad, “Algorithmic and architectural techniques for lowpower digital signal processing,” Ph.D. dissertation, Purdue Univ., WestLafayette, IN, 1999.

[8] J. G. Proakis,Digital Communication, 3rd ed. New York: McGraw-Hill, 1995.

[9] , “Adaptive equalization for tdma digital mobile radio,”IEEETrans. Veh. Technol., vol. 40, pp. 333–341, May 1991.

[10] G. D. Forney, “Maximum likelihood sequence estimation of digital se-quences in the presence of intersymbol interference,”IEEE Trans. In-form. Theory, vol. IT-18, pp. 363–378, May 1972.

[11] C. A. Belfore and J. H. Park, “Decision feedback equalization,”Proc.IEEE, vol. 67, pp. 1143–1156, Aug. 1979.

[12] S. Haykin,Adaptive Filter Theory, 3rd ed. Englewood Cliffs, NJ: Pren-tice-Hall, 1996.

[13] J. M. Rabaey, Digital Integrated Circuits: A Design Perspec-tive. Englewood Cliffs, NJ: Prentice-Hall, 1994.

[14] C. R. Baugh and B. A. Wooley, “A two’s complement parallel array mul-tiplication algorithm,”IEEE Trans. Comput., vol. C-22, pp. 1045–1047,Dec. 1973.

[15] S. S. Nayak and P. K. Meher, “High throughput vlsi implementation ofdiscrete orthogonal transforms using bit-level vector–matrix multiplier,”IEEE Trans. Circuits Syst. II, vol. 46, pp. 655–658, May 1999.

Hunsoo Chooreceived the B.S. degree in electricalengineering from Yonsei University, Seoul, Korea, in1998 and the M.S. degree in electrical and computerengineering from Purdue University, West Lafayette,IN, in 2000. He is currently pursuing the Ph.D. degreewith the Department of Electrical and Computer En-gineering at Purdue University.

His main research interest include high-level syn-thesis techniques for low-complexity and low-powerdesign and low-power VLSI design of multimediawireless communications and signal processing sys-

tems.

Khurram Muhammad received the B.Sc. degreefrom the University of Engineering Technology,Lahore, Pakistan, in 1990, the M.Eng.Sc. degreefrom the University of Melbourne, Parkville,Australia, in 1993, and the Ph.D. degree fromPurdue University, West Lafayette, IN, in 1999, allin electrical engineering.

From 1990 to 1991 and 1993 to 1994, he workedin the research and development section of the Car-rier Telephone Industries, Islamabad, Pakistan, wherehe developed board-level designs for rural and urban

telecommunication. He also worked for the G.I.K. Institute of Engineering Sci-ence and Technology Engineering. In 1995, he developed fast simulation tech-niques for DS/CDMA systems in a multipath fading environment at the HongKong University of Science and Technology, Kowloon, Hong Kong. Since 1999,he has been working at Texas Instrument, Dallas, TX, first in the advance readchannel development group and, currently, in communication and control prod-ucts, where he supports digital and mixed signal design in home networkingproducts. His main research interest include CMOS mixed signal VLSI design,advance CAD techniques exploring digital design tradeoffs at circuit and systemlevels, and low-complexity and low-power design.

Kaushik Roy (F’02) received the B.Tech. degreein electronics and electrical communicationsengineering from the Indian Institute of Technology,Kharagpur, India, and the Ph.D. degree from theElectrical and Computer Engineering Department,University of Illinois at Urbana-Champaign, in 1990.

He was with the Semiconductor Process and De-sign Center, Texas Instruments, Dallas, TX, wherehe worked on FPGA architecture development andlow-power circuit design. He joined the electrical andcomputer engineering faculty at Purdue University,

West Lafayette, IN, in 1993, where he is currently a Professor. His research in-terests include VLSI design/CAD with particular emphasis in low-power elec-tronics for portable computing and wireless communications, VLSI testing andverification, and reconfigurable computing. He has published more than 200 pa-pers in refereed journals and conferences, holds five patents, and is a co-authorof a book onLow Power CMOS VLSI Design(New York: Wiley).

Dr. Roy received the National Science Foundation Career DevelopmentAward in 1995, the IBM faculty partnership award, ATT/Lucent Foundationaward, best paper awards at 1997 International Test Conference and 2000International Symposium on Quality of IC Design, and is currently a PurdueUniversity faculty scholar professor. He is in the editorial board ofIEEEDesign and Test, the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, andthe IEEE TRANSACTIONS ONVLSI SYSTEMS. He was Guest Editor for SpecialIssue on Low-Power VLSI inIEEE Design and Testin 1994 and the IEEETRANSACTIONS ONVLSI SYSTEMS in June 2000.


2’s complement computation sharing multiplier

Documents