asic design of a high speed low power circuit for factorial calculation using ancient vedic...

Microelectronics Journal 42 (2011) 1343–1352

Contents lists available at SciVerse ScienceDirect

Microelectronics Journal

0026-26

doi:10.1

n Corr

E-m

sahapra

anup.da

journal homepage: www.elsevier.com/locate/mejo

ASIC design of a high speed low power circuit for factorial calculation usingancient Vedic mathematics

P. Saha a, A. Banerjee b, A. Dandapat c, P. Bhattacharyya d,n

a School of VLSI Technology, Bengal Engineering and Science University, Shibpur, Howrah 711103, West Bengal, Indiab Department of Electronics and Communication Engineering, JIS College of Engineering, Kalyani 741235, Indiac Department of Electronics and Telecommunication Engineering, Jadavpur University, Kolkata 700032, Indiad Department of Electronics and Telecommunication Engineering, Bengal Engineering and Science University, Shibpur, Howrah 711103, West Bengal, India

a r t i c l e i n f o

Article history:

Received 28 January 2011

Received in revised form

2 September 2011

Accepted 5 September 2011Available online 29 September 2011

Keywords:

Vedic multiplier

Incrementer

Zero detectors

Decrementer

Factorial design

High speed.

92/$ - see front matter & 2011 Elsevier Ltd. A

016/j.mejo.2011.09.001

esponding author. Tel.: þ91 3326684561; fa

ail addresses: [email protected] (P. Bh

[email protected] (P. Saha), banerjee.arindam1

[email protected] (A. Dandapat).

a b s t r a c t

ASIC design of a high speed low power circuit for factorial calculation of a number is reported in this

paper. The factorial of a number can be calculated using iterative multiplication by incrementing or

decrementing process and iterative multiplication can be computed through parallel implementation

methodology. Parallel implementation along with Vedic multiplication methodology for calculation of

factorial of a number ensures significant reduction in propagation delay and switching power

consumption due to reduction of stages in multiplication process, in comparison with the convention-

ally used Vedic multiplication methodologies like ‘Urdhva-tiryakbyham’ (UT) and ‘Nikhilam Navatas-

caramam Dasatah’ (NND) based implementation methodology. Transistor level implementation was

carried out using spice specter with standard 90 nm CMOS technology and the results were compared

with the above mentioned conventional methodologies. The propagation delay for the calculation of

4-bit factorial of a number was only �42.13 ns while the power consumption of the same was

�58.82 mW for a layout area of �6 mm2. Improvement in speed was found to be �33% and �24%

while corresponding reduction of power consumption in �34.48% and �24% for the factorial

calculation circuitry in comparison with UT and NND based implementations, respectively.

& 2011 Elsevier Ltd. All rights reserved.

1. Introduction

ASIC implementation of the logarithmic, exponential, trigono-metric and other arithmetic circuits plays a pivotal role in thefield of general and special purpose computer [1,2]. Generally,such type of computations is implemented through softwareprograms, like Newton–Raphson, Taylor–MacLaurin series, orpolynomial approximations. The computation of the factorialcircuitry is of immense importance for ASIC implementation ofsuch series (Newton–Raphson, Taylor–MacLaurin series, or poly-nomial approximations).

The principal components required for hardware implementationof factorial calculation circuitry are incrementer/decrementer andmultiplier for successive multiplication. Therefore the successivemultiplication and incrementer/decrementer limits the overall speedof the factorial implementation technique. Substantial amount ofwork has so far been reported on multiplier [3–10], such as shift and

ll rights reserved.

x: þ91 3326682916.

attacharyya),

@gmail.com (A. Banerjee),

add multiplier, tree multiplier, array multiplier, signed digitmultiplier, etc., to improve the operating speed and powerconsumption. The common multiplication is done using shiftand add operations [3], where the sequential mechanism is usedand produces large propagation delay. In parallel multipliers, thepartial products are generated through Booth’s encoding [4]techniques and the partial products are added with the help ofparallel adders, therefore the generation and addition stages limitoverall speed of the parallel multiplier [5]. To reduce the numberof partial products, modified Booth’s algorithm [6] is one of themost popular mechanisms while to achieve speed improvementsWallace Tree algorithm [7–9] that reduce the number of sequen-tial addition stages can be incorporated. Another solution forpartial product addition has been reported by Wang in 1995,where the compressors [10] are used for the partial productsaddition stages, which reduces the carry propagation signifi-cantly. Canonical Signed Digit (CSD) multiplier [11–13] is thealternate solution for fast multiplication, but the procedure forgeneration of the partial products are same and a large number ofpre- and post-processing units are required for binary to CSDconversion. However, with increasing parallelism, the amount ofshifts between the partial products and intermediate sums to beadded will increase, which may result in the reduction of speed,

www.elsevier.com/locate/mejo

www.elsevier.com/locate/mejo

dx.doi.org/10.1016/j.mejo.2011.09.001

mailto:[email protected]




dx.doi.org/10.1016/j.mejo.2011.09.001

P. Saha et al. / Microelectronics Journal 42 (2011) 1343–13521344

increase in silicon area due to irregularity of structure and alsoincreased power consumption due to increase in interconnectionsresulting from complex routing.

Significant amount of research work has so far been publishedon incrementer/decrementer with an aim to improve speed ofoperation using program counter [14] and frequency divider [15].But the bottleneck of the above mentioned methods is theincorporation of sequential mechanism to implement the design.An incrementer/decrementer can also be implemented usingadder/subtractor [16], but the major drawback being its lowoperating speed due to long carry propagation from LSB to MSB[17]. In this work, with an aim to circumvent the above draw-backs, multiple look-ahead and dynamic circuitry based incre-menter/decrementer [17] has been adopted, and circuit levelimplementation using multiplexer [18] has been carried out forcomputation of high speed factorial circuitry.

In algorithmic and structural levels, a lot of multiplicationtechniques had been developed to enhance the performance ofmultiplier; by reducing the partial products and subsequentaddition stages. Vedic Mathematics [19] is the ancient systemof Indian mathematics, which has a unique technique of calcula-tions based on 16 Sutras (Formulae). ‘‘Urdhva-tiryakbyham’’(a Sanskrit word means ‘vertically and crosswise’) formula, isused for smaller number multiplication. Few research papers[20–23] have so far been published using ‘‘Urdhva-tiryakbyham’’formula aiming for fast multiplication. However, Mehta and Gwali[24] explored multiplication using ‘‘Urdhva-tiryakbyham’’ sutraindicating the carry propagation issues.

Likewise, a multiplier design using ‘‘all from 9 and last from10’’ formula (‘‘Nikhilam Navatascaramam Dasatah’’ sutra) hasbeen reported by Tiwari et al. [25] in 2009, but without anyhardware module implementation in the circuit level. Veryrecently, we [26] reported a general multiplier based on the sameprinciple of NND sutra but the specific application like parallelmultiplication methodology was not explored. Two main formu-lae of ancient ‘‘Vedic mathematic’’ called ‘‘Urdhva-tiryakbyham(UT)’’ and ‘‘Nikhilam Navatascaramam Dasatah (NDD)’’ wereused for multiplication process facilitating factorial circuitryimplementation. In this approach, an (N�N) bit multiplierimplementation was transformed into just one small multiplica-tion (bit length 5N) and one adder/subtractor implementation.The reported multiplication methodology was compared withpreviously reported Vedic mathematical architectures [20,26].It was observed that proposed design not only produced lessnumber of partial products but also exhibited regular structure,thereby leading to an optimized layout area. As, the factorial of anumber is the product of all the positive integers up to it, thefactorial of a number can be calculated by the iterative multi-plication of the given number with the decremented values of thegiven number up to 1 or the iterative multiplication up-to thenumber starting from ‘1’. On account of recursive multiplication,the overall factorial circuitry exhibit irregular array architectureleading towards large layout area and higher propagation delay.In this work, this problem was resolved by the parallel imple-mentation methodology. The resulted multiplication procedurewas applied for high speed factorial computation circuitry andresults were compared with other Vedic multiplication [20,26]based implementations. The overall factorial circuitry and all theindividual circuit modules like, incrementer/decrementer, zerodetector, different bit length multiplication, etc., were simulatedand performance parameters such as propagation delay anddynamic switching power consumption were calculated throughspice simulator in 90 nm CMOS technology. It was revealed thatcomputation of 4-bit factorial of a number circuitry consumed�58.82 mW power with a delay of �42.13 ns for a layout areaof �6 mm2.

2. Algorithm for factorial computation

The factorial of a number is the product of all the positiveintegers less than or equal to it. The factorial of a number can becalculated by the iterative multiplication of the given numberwith the decremented values of the given number or the iterativemultiplication up-to the number starting from ‘1’. This does notinclude zero. Factorial zero is defined as being 1. Mathematicallyfactorial of input number can be written as:

n!¼

1 n¼ 0Yn

k ¼ 1

k nZ0

8><>: ð1Þ

2.1. Mathematical representation

Consider n bit numbers X, and then X can be represented as

X ¼ 2k�z ð2Þ

where k is the exponent of X and z is the residue. The next input tothe multiplier would be either incremented or decremented valueof X. Assume, next input is equal to X71

X71¼ 2k�z71 ð3Þ

Then factorial of a number can be computed either bydecremented value up-to ‘1’ and iterative multiplication of thenumber or incremented value starting from ‘1’ up-to the samenumber. Mathematically the formula can be represented as

Result¼ PI ¼

1 for n¼ 0Yn

k ¼ 1

k for nZ1

8><>: ð4Þ

From Eq. (4) general expression of the product terms after nthiteration is equal to Pn.

Mathematically PI can be formulated as:

PI ¼ 2kPI�1�zPI�17 IPI�1 ð5Þ

Here I is the number of iteration to be executed to calculate thefactorial of a number. Proof of Eq. (5):

Assume I¼1

P1 ¼ XðX71Þ ¼ ð2k�zÞð2k

�z71Þ ¼ 22k�2k2zþz27 ð2k

�zÞ ð6Þ

P1 ¼ 2kðX�zÞþzz7X ð7Þ

Assume I¼2Now consider again Y is either incremented or decremented by

one. So Y is replaced by its new value

P2 ¼ P1ðX72Þ ð8Þ

P2 ¼ 2kP1�zP172P1 ð9Þ

Flow chart representation of algorithm for computation offactorial of a number is described in Fig. 1. In this diagram resultis initialized to ‘1’, because factorial ‘0’ is defined as ‘1’. First theinput number is checked by a zero detector circuitry, if the inputnumber is equal to ‘0’ then the result is directly displayed ‘1’(without entering the actual factorial calculation circuit leading toless power consumption) otherwise the input number is fedtowards the inner loop. Basically, there are two ways for thecomputation of factorial of a number, i.e: (i) by decrementingprocess from that number and recursive multiplication; and (ii)incrementing process up-to that number starting from ‘1’ andrecursive multiplication. ‘Up-Down’ signal has been considered tofor both way computations. If the ‘UpDown’ signal is low thenit follows the decrement and iterative multiplication process,

P. Saha et al. / Microelectronics Journal 42 (2011) 1343–1352 1345

otherwise it follows the increment and iterative multiplicationprocess. The iterative counter (i) has been initialized as ‘1’ forincrementing process and initialized as input number for decre-menting process. A comparator circuitry has been introduced forcomparing the iterative counters result; it is equal to the inputnumber or it is greater than ‘0’ for incrementing and decrement-ing processes, respectively.

To compute a value of factorial of a number, the hardware consistfour parts, viz., (i) zero detector, (ii) decrementer/incrementer, (iii)comparator, and (iv) multiplier. The main function of the zerodetectors is to check the input values, whether it is zero or not. Ifthe input number is zero, then output of the zero detectors promotedto the final results, i.e. equals to ‘1’, otherwise it passes throughrecursive multiplication procedure. The procedure for the iterativemultiplication can be implemented in two ways, like, decrementing

Fig. 1. Flow chart representation of algorithm for calculation of factorial of an

integer number.

Fig. 2. Multiplication methodology fo

process or incrementing process. Assume, input is a 4-bit numberand the first time, input is either incremented or decremented, as aresult, input number and its incremented/decremented numbers aremultiplied and produce 8 bit output. The first time incremented ordecremented result is again incremented/decremented for secondtime and multiplied with the 8 bit output, which was produced frommultiplier block. Thus, again 8�4 bit multipliers are required, andproduce 12 bit output. The procedure continued till the last iteration.

2.2. Drawback of existing algorithm

The bottleneck of existing algorithm can be envisaged as; dueto the recursive multiplication, the length of the multiplierincreases with respect to iteration, leading to excessive complex-ity of hardware, which in turn results in excessive delay and largepower consumption. Moreover, in this case at a time only onemultiplication can take place, thereby overall propagation delayfurther increases. To solve these problem, parallel implementa-tion methodologies has been considered in the proposed circuitimplementation, which reduces the multiplication stages.

2.3. Modified factorial calculation algorithm

In this section, factorial calculation algorithm has been com-puted in parallel manner leading towards high speed operation.The pseudo-code for the modified algorithm is given below;where input of the given number is initialized as Num. Arr1[i]has been considered for storing the initial value, and Arr2[j],Arr3[k], Arr4[l], and Arr5[m], has been considered for storing thesecond or higher stages multiplication values, respectively.

r m
odified factorial computation.
Fact(Num)

for each i from 0 to Num-1
Arr1[i]¼ iþ1
end forfor each i from Num to 15

Arr1[i]¼1end for

for each j from 0 to i/2
Arr2[j]¼Arr1[2 nj]nArr1[2njþ1];
end for

for each k from 0 to j/2
Arr3[k]¼Arr2[2 nk]nArr2[2nkþ1];
end for


for each l from 0 to k/2

Fig. 3. Implementation

Arr4[l]¼Arr3[2
nl]nArr3[2nlþ1]; end for for each m from 0 to l/2
Arr5[m]¼Arr4[
2nm]nArr4[2nmþ1]; end for
return Arr5[m]

Fig. 2 shows the schematic representation of above mentionedhardware implementation methodology, where, all the inputregisters has been initially filled with ‘‘0001’’. Parallel implemen-tation methodology has been adopted for computation, and as aresult, 4-bit factorial of a number can be calculated within only4 stages, thereby, significant reduction in the propagation delaytakes place. Vedic multiplication methodologies have been imple-mented for multiplication.

3. Vedic mathematics for multiplication

The potentiality of ‘Vedic Mathematics’, especially for calcula-tions regarding multiplications, was reported by ‘Sri Bharati KrsnaThirthaji Maharaja’, in the form of Vedic Sutras (formulae) [19].He explored the mathematical potentials from Vedic primers andshowed that the mathematical operations can be carried outmentally to produce fast answers using these ‘Sutras’. In thispaper, only ‘NDD’ and ‘UT’, formulae are used.

3.1. ‘‘Nikhilam Navatascaramam Dasatah’’ (NDD) sutra

‘NDD’ means ‘all from 9 and last from 10’. The same formula isapplicable for the implementation of multiplier. Using the samemethodology (N�N) multiplier is transformed into addition/subtraction and a small (5N) multiplication, thereby reducescarry propagation leading towards high speed operation. A simpleexample will suffice to clarify the operations:

As shown in Fig. 3, the multiplier and the multiplicand arewritten in two rows followed by the differences of each of themfrom the chosen base, such that there now exist two columns ofnumbers, one consisting of the numbers to be multiplied (Column1) and the other consisting of their compliments (Column 2). Theproduct also consists of two parts, which are divided by a verticalline for the purpose of illustration. The right hand side (RHS) ofthe product can be obtained by simply multiplying the numbersof the Column 2 (2�3¼6). The left hand side (LHS) of the productcan be found by cross subtraction of the second number ofColumn 2 from the first number of Column 1 or vice versa, i.e.,998�003¼995 or 997�002¼995. The final result is obtained byconcatenating RHS and LHS, i.e., (Answer¼995006). Mathemati-cal description of this sutra can be formulated as

of multiplication using ‘‘NDD’’ sutra.

Assuming A and B are two numbers to be multiplied and theirproduct is equal to P. Mathematically A and B can be expressed as

A¼Xn�1

i ¼ 0

Ai10i and B¼Xn�1

i ¼ 0

Bi10i where Ai,BiAf0,1,. . .,9g ð10Þ

Multiplication rule can be written as

P¼ AB ð11Þ

Eq. (11) can be reformulated by adding and subtracting theterm 102n

þ10n(AþB) in the right hand side

P¼ ABþ102nþ10n

ðAþBÞ�102n�10n

ðAþBÞ ð12Þ

P¼ f10nðAþBÞ�102n

gþ102n�10n

ðAþBÞþAB ð13Þ

Eq. (13) can be derived for both the numbers if the number isgreater than the base or less than the base.

P¼10nð10nþAþBÞþð10n

�AÞð10n�BÞ if A,B410n

10nðA�BÞþðABÞ if A,Bo10n

(ð14Þ

where n is any positive integer and A and B are the 10n0scomplements of A and B. Mathematical expression of ‘‘NikhilamNavatascaramam Dasatah’’ sutra for binary number system isgiven hereunder:

Consider two n bit numbers X and Y, k is exponents, z1, z2 areresidues of X and Y, respectively. Mathematically, X and Y can berepresented as: X ¼ 2k7z1, Y ¼ 2k7z2

The product term of X and Y is assumed as P and can berepresented as:

P¼ X � Y ¼ ð2k7z1Þð2k7z2Þ ð15Þ

For the fast multiplication using extended rule of the sutra thebases of the multiplicand and the multiplier assuming same, thusthe Eq. (15) can be rewritten as

P¼ XY ¼ 2kðX7z2Þ7z1z2 ð16Þ

From Eq. (16) it is observed that a large number multiplicationcan easily decomposed into a small number multiplication,addition/subtraction and shifting, leading towards the reductionof hardware cost, propagation delay and power consumption.Small number of the multiplication can easily implemented using‘‘Urdhva-tiryakbyham’’ sutra (formula).

3.2. ‘‘Urdhva-tiryakbyham’’ (UT) sutra

The meaning of the term ‘UT’ is ‘‘Vertically and crosswise’’ andit is applicable to all the multiplication operations. This procedureis simply known as array multiplication technique [24]. Mathe-matical expressions of ‘UT’ for binary number system isgiven below:

Assume the product of two N-bit words are described as

X ¼XN�1

i ¼ 0

xi2i

ð17Þ

and

Y ¼XN�1

j ¼ 0

yj2j

ð18Þ

where xi, yj e {0, 1}Multiplication can be described as

P¼ XY ¼XN�1

i ¼ 0

xi2iXN�1

j ¼ 0

yj2j

ð19Þ

P¼X

i

Xj

xiyj2iþ j

ð20Þ


Let k¼ iþ j

P¼X2N�1

k ¼ 0

XN�1

i ¼ 0

xiyk�i2k

ð21Þ

P¼X2N�1

k ¼ 0

pk2kð22Þ

where

pk ¼ xiyk�i ð23Þ

4. Circuit modules and complete factorial design circuit

The advantages of CMOS transmission gate (TG) logic overconventional CMOS and CPL [26,27] logic are well established. Asthe CMOS transmission gate consists of one PMOS and one NMOS,connected in parallel, the ON resistance is smaller than even asingle NMOS. The circuit modules required for computation offactorial of a number are described in the following subsections.All the circuit modules for the computation have been imple-mented using TG. Sections 4.1–4.8 describe the operations of allthe modules and subsequently complete design of factorialcalculation is described in Section 4.9.

4.1. Zero detectors

Consider the array of n bit number given as X¼xn�1, xn�2,y,x2, x1, x0. The zero detector circuit identifies the input number iszero or not and if the input number is zero then it sets the LS bit ofthe number to logic ‘1’. This implementation has been consideredfor factorial ‘0’ computation. Hardware implementation of zerodetectors is shown in Fig. 4, where X is the input and output isrepresented as Y. Boolean’s equation for zero detector has beenimplemented form Fig. 4.

Control¼ x3þx2þx1þx0 ¼ Ctrlþx0 where Ctrl¼ x3þx2þx1 ð24Þ

Y3 ¼ buffered x3 ð25Þ



Y0 ¼ Ctrlþx0 ð28Þ

In the zero detector circuit, one ‘control’ signal is generated,which indicates that the input number is zero or not. This signal isfed to the incrementer circuit and determines whether incrementingoperation is required or not. When all the input bits are zero then

Fig. 4. Circuitry for checking zero value at the input bit stream.

‘control’¼0 and no incrementing operation is done. When input bitsare non-zero then incrementing operation is required as ‘control’¼1.

Another signal ‘Ctrl’ is generated to set (logic 1) the LS bit ofthe input number when all the input bits are zero. Buffers areused to pass the all other input bits (except LS bit) as it is.

4.2. Comparator

Comparator circuit [28] is required to compare the value forincrementer based design for computing of factorial of a number.Incrementer block increments the value, which is starting from ‘0’and the comparator block is comparing the incremented valueand the given number. If the incremented value is less than thegiven number then it is incremented again, and the same stage ofcomparison is followed. The iterative process continues until theincremented value is equal to the given number. To compare twonumbers we have used two parallel adder stages that checkswhether one number is equal, greater or lesser than the othernumber. Let us assume the two 4-bit numbers to be compared areA and B. The numbers can be defined as A¼a3, a2, a1, a0 and B¼b3,

b2, b1, b0. We want to compare the values of A with respect to thevalues of B, (Here the value of B is taken as a fixed number and thenumber A is user defined. To simplify the calculation the numberis taken as 4 bit array; higher bit array can be calculated in similarmanner). The first stage adder basically performs subtractionoperation. A,B are the inputs of the parallel adder and the carrybit is set high. After first stage of addition if the resultant carry bit(first stage) is high then B4A or B¼A. If BoA then the resultantcarry bit is low. Now consider the case B4A or BoA, which givesthe second stage ‘XOR’ output to be non-zero, and the secondstage resultant carry is high. If A¼B then the ‘XOR’ output is equalto zero and the second stage resultant carry bit is ‘low’. Finally thefirst stage resultant carry and second stage resultant carry bit arepassing through an ‘AND’ operation producing the control signal.Hardware implementation of comparator is shown in Fig. 5.

4.3. Incrementer/decrementer

In this section, multiplexer based incrementer [17,18] hasbeen adopted for computation of high speed factorial circuitry.Mathematical explanation of the reported design is shown below.

Y ¼ Xþ1 ð29Þ

YJ ¼XJ � X0þX1þ � � � þXJ�1

� �1r jrn�1

X0 j¼ 0

8<: ð30Þ

Circuit level implementation has been carried out using CMOStransmission gate [TG] to make the circuit faster. An n-bit MUX-based incrementer is designed as shown in Fig. 6. It is composedof a data-out MUX array and a selection module (SM) used to findthe first one bit. The output of SM is D0, D1,y,Dn�1. It can benoted that each bit of the decrement result Y can be derived by aMUX operation.

4.4. Adder/subtractor

The conventional adder/subtractor block has been implemen-ted [27] to perform addition as well as subtraction in a singleblock, and their performance parameters have been checkedusing standard 90 nm CMOS technology. Here the control(addsub) signal is used for the operation of addition or subtrac-tion. For addition purpose the addsub signal is active low and tosubtract it is active high. The circuit level diagram for the reporteddiagram is shown in Fig. 7.

Fig. 5. Hardware implementation of comparator.

Fig. 6. Hardware implementation of multiplexer based incrementer.

Fig. 7. Hardware implementation of Adder/Subtractor.

Table 1Combination of shifting operation.

A7 A6 A5 A4 A3 A2 A1 A0 When ‘‘S2S1S0’’¼ ‘‘000’’

A6 A5 A4 A3 A2 A1 A0 0 When ‘‘S2S1S0’’¼ ‘‘001’’

A5 A4 A3 A2 A1 A0 0 0 When ‘‘S2S1S0’’¼ ‘‘010’’

A4 A3 A2 A1 A0 0 0 0 When ‘‘S2S1S0’’¼ ‘‘011’’

A3 A2 A1 A0 0 0 0 0 When ‘‘S2S1S0’’¼ ‘‘100’’

A2 A1 A0 0 0 0 0 0 When ‘‘S2S1S0’’¼ ‘‘101’’

A1 A0 0 0 0 0 0 0 When ‘‘S2S1S0’’¼ ‘‘110’’

A0 0 0 0 0 0 0 0 When ‘‘S2S1S0’’¼ ‘‘111’’


4.5. Logical shifter (LS)

Logical shifter is a Barrel shifter, which can shift a number, morethan one times given by the select inputs. The shifting operationexecuted by a Barrel shifter is shown in Table 1. Here we havedesigned the left shift operation. Fig. 8 shows the architecture for8 bit Logical Shifter. LS consist of several multiplexers. The numberof multiplexers required can be determined by the number of

outputs. In general, for n bit inputs, the number of select linesneeded islog2n. So for eight bit input, the number of select inputs islog28¼ 3. ‘‘00000001’’ is initially loaded to the input of the multi-plexers. As for example if the select inputs ‘‘S2 S1 S0’’¼ ‘‘111’’ then theshifted output is ‘‘10000000’’.

4.6. Multiplier using ‘UT’ sutra

Eq. (23) shows that the co-efficient of multiplication that can beachieved by the convolution sum of the two finite numbersequences. Considering the long-hand sequences of Eq.. (22), 4 bitmultiplier algorithm is shown in Fig. 9. The hardware implementa-tion of the same principle can be implemented using standard arraymultiplication technique [26]. For sake of simplicity 4-bit multiplieris considered, higher order bit multipliers can be realized in a similarmanner. Partial products are added in two stages. Adders and 4 to3 compressors are used to minimize the stage operations. Compres-sors and adders are used carefully so that a minimum number ofoutputs would be generated. Thus using minimum number ofadders/compressors partial products are added without compromis-ing the number of bits generation for the next stage operation.

4.7. Radix extraction unit (REU)

Generally binary Radix can be defined as, Radix¼ 2Ex¼Pn�1

i ¼ 0 ri2i.Where, the term ‘Ex’ is the exponent, which can be

expressed as, Ex¼Plog2n�1

i ¼ 0 exi2i. Architecture of the radix extrac-

tion unit is shown in Fig. 10. The output of the priority encoderis the radix, which is again fed to the binary encoder, whichultimately generates the exponent [29]. In the following example,the function of the RSU has been discussed. Example: If the binaryinput is ‘‘1110’’ (1410), then the PE generates ‘‘00010000’’ (1610),

Fig. 8. Hardware implementation of Logical Shifter.

Fig. 9. Multiplication Procedure using ‘‘UT’’ sutra.

Fig. 10. Architecture for Radix Extraction Unit.

Fig. 11. Block diagram of complete Vedic multiplier.


which is of eight bits. The corresponding encoder output that isthe exponent is ‘‘100’’ (410).

4.8. Complete design of Vedic multiplier

Hardware implementation of NND sutra is shown in Fig. 11.The architecture can be decomposed into four main subsections:

(i) Radix Extraction Unit (REU) (ii) Adder/Subtractor (iii) LogicalShifter and (iv) Multiplier using ‘‘UT’’ sutra. The REU is required toselect the proper radices, and its exponent values correspondingto the input numbers. The selected radix is chosen nearer to thegiven number, and resulted as easier multiplication. The Sub-tractor blocks are required to extract the residual parts i.e., z1 andz2. Multiplication values of z1 and z2 has been easily determinedby ‘‘UT’’ sutra. The first adder-subtractor block has been used tocalculate the value of ðX7z2Þ. Output result taken from adder/subtractor is simply logically left shifted by k unit to produce thevalue of 2k

ðX7z2Þ. The final result can be implemented by addingor subtracting the shifter output and the multiplication which hasbeen implemented by ‘‘UT’’ sutra’s output.

4.9. Complete factorial design circuitry

Calculation of factorial of a number has been computed inparallel manner leading towards high speed operation. Fromflowchart diagram (Fig. 2), where, all the input registers has beeninitially filled with ‘‘0001’’. At the starting point of the flowchartdiagram first it checks whether the input number is zero or not. Ifthe input is zero then for all the register values are set to ‘1’, thenmultiplication starts. If the input value is not equal to zero, thenincrementer start incrementing the values up-to the number, andregister values are updated with the incremented values formultiplication. Parallel implementation methodology has been

Fig. 12. Block diagram for hardware implementation of factorial calculation.

Table 2Performance parameters like propagation delay (ps), average dynamic power

consumption (mW) and Energy delay product (10�27) J–s analysis of different

components such as zero detector, incrementer/decrementer, comparator, adder/

subtractor, REU.

Circuit module Delay (ps) Power (mW) EDP (10�27)J–s

Zero detector 120 1.02 14.16

Incrementer/decrementer 180 3.14 101.74

Comparator 148 2.15 47.09

Adder/subtractor (4-Bit) 140 0.856 16.3

REU 376 0.678 95.85

8


considered for computation; as a result, 4-bit factorial of anumber can be calculated within 4 stages, thereby, significantlyreduces the propagation delay. The block diagram for factorialcalculation is shown in Fig. 12.

4x4 8x8 16x16 32x320

2

4

6

Pro

paga

tion

Del

ay (n

S) [20]

[26] Proposed

4x4 8x8 16x16 32x320

1

2

3

4

5

6

7

10121416

Pow

er (u

W)

[20] [26] proposed

Fig. 13. Comparison of results of different type Vedic multipliers (VM), imple-

mented in same environment, in terms of performance parameters such as

propagation delay (ns) and dynamic switching power (mW), as a function of input

number of bits.

5. Results and discussions

Transistor level simulation for factorial calculation circuit wasperformed through Spice Specter simulator using 90 nm CMOStechnology with 1 V power supply, operated at 10 MHz. Dualthreshold voltage (VT) operating mode was considered for simula-tion to determine the performance parameters. In designingcalculation of factorial of a number like 3-bit, 4-bit number and5-bit, all the individual modules such as zero detector, incremen-ter/decrementer, Vedic multiplier was implemented using TG tomake the circuit faster. Lowering supply voltage reduces thepower dissipation in quadratic fashion and becomes attractive.Though low supply voltage affects delay but it is compensated bythe lower RC delay of TG circuit and the dual threshold CMOStechnology. It is also to be noted that each TG circuit requires lessnumber of transistor than conventional CMOS implementedcircuits thus reduces the layout area. The individual performanceparameters such as propagation delay, dynamic switching powerconsumption and Energy Delay Product (EDP) for different circuitmodules, i.e. zero detector, incrementer/decrementer, comparator isshown in Table 2. We focused our main concentration for reducingthe propagation delay, dynamic average switching power consump-tion and energy delay product.

It is worth mentioning here, that we have taken the implemen-tation methodology from different references [20,26], and imple-mented it in the same technological environments (spice specterwith standard 90 nm CMOS technology) and then compared theperformance parameters. Fig. 13 indicating the comparison resultsof different multipliers, which have been designed in the sametechnological environment. For the comparison point of view theideas have been considered form the references and simulated andperformance parameters was computed using the same MOSFETtechnology file. Input data was taken in a regular fashion forexperimental purpose. The delay and the power measured usingthe worst-case pattern and from the output where the delay ismaximum. From Fig. 13, it is observed that, for higher bit lengthmultiplication Vedic multiplier offered substantial reduction ofpropagation delay and dynamic switching power consumptions.

Fig. 14 indicates the performance parameters such as propagationdelay, and dynamic switching power consumptions analysis forfactorial computation of different bit length sequences, like 3-bit,4-bit and 5-bit number. Fig. 14 also indicating that comparison of thesame circuitry, which have been implemented by different architec-tures such as ‘UT’and ‘NND’. From Fig. 14, it is observed that theproposed design offered �33% and �24% improvement in propaga-tion delay while corresponding reduction of power consumption in

�34.48% and �24% for the factorial calculation circuitry in compar-ison with UT and NND based implementations, respectively. Fig. 15represents the layout of the proposed factorial circuitry for a 4-bitnumber using parallel Vedic multiplication methodology, for a layoutarea of only �6 mm2. It can be envisaged from the above discussionthat the Vedic multiplier is the most critical element in improvingthe speed of the circuit to compute Factorial of a number.

3-Bit 4-Bit 5-Bit0

20

40

60

80

100

120

140

160

180

Pro

paga

tion

Del

ay (n

S)

[20] [26] Proposed

3-Bit 4-Bit 5-Bit0

50

100

150

200

Pow

er C

onsu

mpt

ion

(mW

)

[20] [26] Proposed

Fig. 14. Comparison of results for computation of factorial of a number using

parallel Vedic implementation methodology in terms of performance parameters

like propagation delay (us) and dynamic switching power (mW) analysis as a

function of input bit length.

Fig. 15. Layout of factorial design circuit using parallel Vedic multiplication

methodology. Layout consumes only �6 mm2 area. Layout have been implemen-

ted using L-Edit V-13 of T-Spice simulator.


6. Conclusion

In this paper, based on ancient Vedic mathematics, we report ona novel circuitry for computation of factorial of a 4-bit number

through parallel implementation methodology. This novel architec-ture combines the advantages of ancient Vedic formulae and theparallel implementation techniques thereby leading to significantreduction in the number of stages, resulting in high speed operation.In circuit realization, an (N�N) bit multiplier implementation wastransformed into just one small multiplication (bit length 5N) andone adder/subtractor implementation, thereby high speed operation,for factorial computation. The propagation delay for the calculationof 4-bit factorial of a number was only �42.13 ns while the powerconsumption of the same was �58.82 mW for a layout area of�6 mm2. Improvement in speed was found to be �33% and �24%while corresponding reduction of power consumption in �34.48% ,�24% for the factorial calculation circuitry in comparison with UTand NND based implementation respectively. It can be envisagedthat speed improvement in factorial computation circuit is attrib-uted significantly from incorporation of the Vedic multiplier.

References

[1] J.P. Deschamps, G.J.A. Bioul, G.D. Sutter, Synthesis of Arithmetic Circuits, FPGA,ASIC and Embedded Systems, Wiley Interscience Publications, 2006 180–198.

[2] J.F. Hart, E.W. Cheney, C.L. Lawson, H.J. Maehly, C.K. Mesztenyi, J.R. Rice,H.G. Thacher, C. Thacher, H.G. Witzgall Jr., Computer Approximations, Wiley,1968.

[3] M. M.-Dastjerdi, A. A.-Kusha, M. Pedram, BZ-FAD: A Low-Power Low-AreaMultiplier Based on Shift-and-Add Architecture, IEEE Trans. Very Large ScaleIntegr. (VLSI) Syst. 17 (2) (2009) 302–306.

[4] A.D. Booth, A signed binary multiplication technique, Q. J. Mech. Appl. Math.(1952) 236–240 IV.

[5] Y.-H. Seo, D.-W. Kim, A. New VLSI, Architecture of Parallel Multiplier–Accumulator Based on Radix-2 Modified Booth Algorithm, IEEE Trans. VeryLarge Scale Integr. (VLSI) Syst. 18 (2) (2010) 201–208.

[6] J. Hu, L. Wang, T. Xu, A low-power adiabatic multiplier based on modifiedBooth algorithm, in: Proceedings of the IEEE International Symposium onIntegrated Circuits, Singapore, September 2007, pp. 489–492.

[7] C.S. Wallace, A suggestion for a fast multiplier, IEE Trans. Electron. Comput.EC-13 (1) (1964) 14–17.

[8] M. Young, The Techincal Writers Handbook, CA: University Science, MillValley, 1989.

[9] F. Carbognani, F. Buergin, N. Felber, H. Kaeslin, W. Fichtnes, A 2.7-/SPL mu/W/MHz transmission-gate-based 16-bit multiplier for digital hearing aids, in:Proceeding of the IEEE 48th Midwest Symposium on Circuit and Systems,Covington, KY, August 2005, pp. 1406–1409.

[10] Z. Wang, G.A. Jullien, W.C. Miller, A new design technique for columncompression multipliers, IEEE Trans. Comput. 44 (8) (1995) 962–970.

[11] K.-J. Cho, S. Jo, Y.-E. Kim, Y.-N. Xu, J.-G. Chung, Constant multiplier designusing specialized bit pattern adders, in: Proceeding of the IEEE FifteenthInternational Conference on Electronics, Circuits and Systems, St. Julien’s,August 2008, pp. 41–44.

[12] S.L. Chen, X.-Y. Tian, X.-J. Zhao, Improved multiplier of CSD used in digitalsignal processing, in: Proceeding of the IEEE International Conference onMachine Learning and Cybernetics, Kunming, July 2008, pp. 2905–2908.

[13] A. Avizienis, Signed-digit number representations for fast parallel arithmetic,IRE Trans. Electron. Comput. EC-10 (1961) 389–400.

[14] M.R. Stan, A.F. Tenca, M.D. Ercegovac, Long and fast up/down counters, IEEETrans. Comput. 47 (7) (1998) 722–735.

[15] D.R. Lutz, D.N. Jayashima, Programmable modulo-K counters, IEEE Trans.Circuits Syst.: Fund. Theory Appl. 43 (11) (1996) 939–941.

[16] R. Hashemian, Highly parallel increment/decrement using CMOS technology,in: Proceedings of the 33rd IEEE Midwest Symposium on Circuit and System,Calgary, Alberta, Canada, August 1990, vol. 2, pp. 866–869.

[17] C.-H. Huang, J.-S. Wang, Y.-C. Huang, A high-speed CMOS incrementer/decrementer, in: Proceeding of the IEEE International Symposium on Circuitsand Systems, Sydney, Australia, May 2001, vol. 4, pp. 88–91.

[18] S. Bi, W.J. Gross, W. Wang, A. Al-Khalili, M.N.S. Swamy, An area-reducedscheme for Modulo 2n�1 addition/subtraction, in: Proceeding of the IEEENinth International Database Engineering and Application Symposium, July2005, pp. 396–399.

[19] J.S.S.B.K.T. Maharaja, Vedic Mathematics, Motilal Banarsidass Publishers PvtLtd, Delhi, 2001.

[20] P. Mehta, D. Gawali, Conventional versus Vedic mathematical method forhardware implementation of a multiplier, in: Proceedings of the IEEEInternational Conference on Advances in Computing, Control, and Telecom-munication Technologies, Trivandrum, Kerala, December 2009, pp. 640–642.

[21] M. Ramalatha, K. Thanushkodi, K.D. Dayalan, P. Dharani, A. Novel Time andenergy efficient cubing circuit using Vedic mathematics for finite fieldarithmetic, in: Proceedings of the IEEE International Conference on Advancesin Recent Technologies in Communication and Computing, Kerala, October2009, pp. 873–875.


[22] M. Ramalatha, K.D. Dayalan, P. Dharani, S.D. Priya, High speed energyefficient ALU design using Vedic multiplication techniques, in: Proceedingsof the IEEE International Conference on Advances in Computational Tools forEngineering Applications, Zouk Mosbeh, July 2009, pp. 600–603.

[23] S. Akhter, VHDL implementation of fast N�N multiplier based on vedicmathematic, in: Proceedings of the IEEE, Eighteenth European Conference onCircuit Theory and Design, Seville, August 2007, pp. 472–475.

[24] P. Mehta, D. Gawali, Conventional versus Vedic mathematical method forhardware implementation of a multiplier, in: Proceedings of the IEEEInternational Conference on Advances in Computing, Control, and Telecom-munication, Trivandrum, Kerala, December 2009, pp. 640–642.

[25] H.D. Tiwari, G. Gankhuyag, C.M. Kim, Y.B. Cho, Multiplier design based onancient Indian Vedic Mathematics, in: Proceedings of the IEEE InternationalSoC Design Conference, Busan, November 2008, pp. 65–68.

[26] P. Saha, A. Banerjee, P. Bhattacharyya, A. Dandapat, High Speed ASIC Designof Complex Multiplier Using Vedic Mathematics, in: Proceedings of the IEEEStudent Technology Symposium, Kharagpur, January 2011, pp. 237–241.

[27] P.K. Saha, A. Banerjee, A. Dandapat, High Speed Low Power, Complex multi-

plier design using parallel adders and subtractors, Int. J. Electron. Elect. Eng.(IJEEE) 07 (11) (2009) 38–46.

[28] S. Veeramachaneni, M.K. Krishna, L. Avinash, P.S. Reddy, M.B. Srinivas,

Efficient design of 32-bit comparator using carry look-ahead logic, in:Proceedings of the IEEE Northeast workshop on Circuits and Systems,Montreal, August 2007, pp. 867–870.

[29] R. Hashemian, A high speed compact priority encoder, in: Proceedings of the

IEEE, 32nd Midwest Symposium on Circuits and systems, Champaign, August1989, pp. 197–200.

asic design of a high speed low power circuit for factorial calculation using ancient vedic...

Documents

bengal engineering

telecommunication engineering

multiplication process

iterative multiplication

implementation methodology

west bengal

high speed low power

jis college of engineering