om jai jagdish ji11

Upload: indu-bharti

Post on 12-Jul-2015

241 views

Category:

Documents


0 download

TRANSCRIPT

Chapter 1INTRODUCTION1.1 Preamble The increasing demand for high performance, high speed and battery-operated system in communication and computing has shifted the focus from traditional constraints (such as area, performance, cost, reliability) to power consumption. Multiplier is used as computational unit in various systems like DSP and in Microprocessors etc. A Multiplier is said to be efficient if it is having high speed, low power and less area. Generally in digital system we deal with two type of logic static and dynamic. The largest difference between static logic and dynamic logic is that in dynamic logic a clock is used. Dynamic logic is over twice as fast as normal logic. Dynamic logic is harder to work, but if need the speed there is no other choice. Dynamic logic requires two phases, the first phase is set up phase or precharge phase, in this phase the output is unconditionally go to high(no matter the values of the inputs). The capacitors which present the load capacitance of this gate become charged. During evaluation phase, clock is high. Popular implementation of dynamic logic is domino logic. Domino logic is a CMOS based evaluation of the dynamic logic techniques which are based on the either PMOS or NMOS transistors. It was developed to speed up the circuits. The dynamic gate outputs connect to one inverter, in domino logic. In domino logic, cascade structure consisting of several stages, the evaluation of each stage ripples the next stage evaluation, similar to a domino falling one after the other. Once fallen, the node state cannot return to 1 (until the next clock cycle), just as dominos, once fallen, cannot stand up. The structure is hence called domino CMOS logic. Domino logic is used in today's high performance microprocessors for implementing circuits that are both high speed and area efficient. Among its many advantages, domino logic provides reduced input capacitance, and low power dissipation. The domino logic circuits are driven by clock. The distribution network dissipates from 20% to 45% of the overall consumed power, thus this prevents the use of domino logic circuits in low power applications. The multiplier operation consists of various shifting and addition steps. The multiplier will be faster if the addition steps are performed by high speed adders. For this purpose I have used the1

fast Carry lookahead, Wallace tree and Kogge stone adder. Also I have used modified booth algorithm to design the multiplier as it reduces the no of partial products almost to half .this reduces the power dissipation and area. The performance evaluation of multiplier is necessary in order to know which multiplier design is meeting a particular systems requirement. Because different system may require a multiplier with different parameters for example a microprocessor requires a multiplier with high speed but other system may require a multiplier with less area. We cannot get the two things in one design. So in order to know which multiplier design will suit to our requirement, the performance of multiplier is evaluated. 1.2 Historical perspective In the early CMOS days there was great need of low power technologies. But due to its unavailability the CMOS industries was not able to reduce the chip size or area , because reducing chip size means more no. of circuits on a single chip. Thus the power per unit area increases which leads to power dissipation. Therefore there is a great need of circuits which are having less power dissipation. This need leads to the invention of fast and low power dynamic circuits. So in the NMOS days of 1970s, dynamic logic was used to reduce power consumption (inherent in NMOS logic) and area. But this dynamic logic (implemented on the cmos circuit) was having contention problem. Then there comes footed dynamics circuits, which removes the problem of contention. There was a problem associated with these dynamic circuits that is the monotonicity requirement. Then in the early 1980s variant of dynamic circuits was proposed. The proposed variant was known as domino logic circuits. Domino logic was proposed by Robert Krambeck 1982 [10]. This domino logic circuit the only produces inverting output but however certain applications require inverting as well as non inverting operations. Therefore this need leads to dual-rail domino logic circuits in the recent years. As multiplier operation consist of various shifts and adds operations so talking about the historical background of the adders used in the multiplier design. In early 1950s ripple carry adder was invented. The adder passes the carry from least significant bit to most significant bit, this result in more delay and hence increased power dissipation. Then after a faster adder known as Carry lookahead added was invented in 1958 by Weinberger. But it was found that the delay2

of Carry lookahead adder increases with increase in the no. bits to be added therefore need was felt to have faster adder. So in 1973 Kogge Stone adder was designed. It is based up on the same principle as Carry lookahead adder or we can say that it is a Carry lookahead adder in which carry is generated in different manner which makes it faster than Carry lookahead adder. After Kogge stone adder many new adders were invented but Kogge stone adder was found faster than other adders. 1.3 Thesis Objective Based up on above discussion the thesis has the following objectives: To design and simulate the two different types of multiplier. The performance evaluation of multipliers will be done in terms of speed, power

dissipation and leakage current. 1.4 Organization of thesis The 2nd chapter explains about the four main modules that is booth encoder partial product generator, Wallace tree adder, the carry lookahead and Kogge stone adder. Or we can say that the 2nd chapter explains the Architecture of the multiplier. In the 3rd chapter includes definition of dynamic circuits, various advantages and disadvantages are studied. Also the reason to place the footed transistor mention in details. The 4th chapter explains the concept of leakage and sources of leakage and a leakage reduction technique. The 5th chapter includes the simulation waveforms, results, and Observations. 1.5 Methodology In this thesis, two different types of multiplier are designed using domino dynamic circuit. In order to have high speed multiplier, adders like Carry lookahead, Kogge stone and Wallace tree are used. Two 4 4 modified booth multipliers are designed and simulated by using Design Architect tool of Mentor Graphics based on 180nm CMOS technology.3

Chapter 2MULTIPLIER ARCHITECTURE2.1 Architecture The multiplier consists of four main modules i.e. booth encoder, partial product generator, Wallace tree adder and Carry lookahead or Kogge stone adder as shown in fig. 2 [11]. The multiplier is based up on modified booth algorithm which reduces the power dissipation by reducing the number of partial product rows. The output of booth encoder acts as input to partial product generator which generates the partial product rows. The partial product rows are further added in the Wallace tree adder. The adder gives sum and carries bits at its output .The sum and carry bits are then added in the Kogge stone adder, which provides the final product bits.Booth encoder

Local clk

Partial product generator

Wallace tree adder

Global clk

Kogge stone adder

Product Fig 2.1 Block Diagram of Multiplier The multiplier uses two clocks, the local and global clocks. The local clock is given to the booth encoder while global clock is given to the Partial product generator, Wallace tree adder and the final adder which may be carry lookahead or Kogge stone adder. The two clocks are given to avoid clock skew.4

2.2 Booth encoder The encoder used in the multiplier is based up on Booth algorithm so it is called Booth encoder. The Booth encoding was proposed by Andrew Donal Booth in 1951 [2]. The algorithm consisted of various shift and add operations, it become difficult when the number of add- subtract operations and the number of shift operations becomes variable so it becomes inconvenient in designing parallel multipliers and also the algorithm becomes inefficient when there are isolated 1s. Booth algorithm was later on modified by O. L. Macsorley in 1961 algorithm this is called Modified Booth Algorithm [11]. The booth encoding algorithm is a bit-pair encoding algorithm that generates partial products which are multiples of the multiplicand. The encoding method is widely used to generate the partial products for implementation of large parallel multipliers, which adopts the parallel encoding scheme. The basic principle for the modified Booth encoding can be described as follows. Let us consider the multiplication of two fixed-point twos complement numbers X and Y, where Y is the multiplier and Y is the multiplicand, both of them have n bits, and the Y can be expressed byY = Yn 1 2 n 1 + Y= Y= Y=i =n/2 -1 i =0 i =n 2 i =0

Y 2 ,i i

(2.1)

(-2Y d .2i =0 i i =0 i

2i +1

+Y2i + Y2i + Y2i -1 )2 2i ,

i =n/2 1

2i

,

i =n/2 1

d .4 ,i

(2.2)

Using this notation the multiplication of X and Y is given byXY = XY =i =n/2 1 i =0

d .4 .X,i i

i =n/2 1 i =0

P .4i

i

,

(2.3)

The booth algorithm shifts and/or complements the multiplicand (X operand) based on the bit patterns of the multiplier (Y operand). Essentially, three multiplier bits [Y (i+1),Y(i), and Y(i-1)] are encoded into nine bits that are used to select multiples of the multiplicand {-2X, -X, 0, +X,5

+2X}. The three multiplier bits consist of a new bit pair [Y(i+1), Y(i)] and the leftmost bit from the previously encoded bit pair [Y(i-1)] as shown in table 2.1.Obviously, from the above equation, the partial product, Pi+1, should be shifted two positions to the left of the partial product, P i , as Pi the is multiplied by 2i.For an n n-bit multiplication, the booth algorithm produces n/2 partial products. That is this algorithm reduces the number of partial products almost to half. It reduces the number of adders by 50% which results in a higher speed, a lower power dissipation, and a smaller area than a conventional multiplication array. Table 2.1 Booth EncodingY(i+1) Y(i) Y(i-1)

Neg 0 0 0 0 1 1 1 0

One

Two 0 0 0 1 1 0 0 0

Operation Multiplicand(X) 0 +1 +1 +2 -2 -1 -1 0

0 0 0 0 1 1 1 1

0 0 1 1 0 0 1 1

0 1 0 1 0 1 0 1

0 1 1 0 0 1 1 0

X X X X X X X X

The problems faced in Booth algorithm can be overcome by using Modified Booth algorithm. The algorithm works as below: 1) Two zeros are appended to the left of the MSB and one zero is appended on the right of LSB of the multiplier bits in unsigned multiplication. 2) Append a 0 to the right of the LSB of the multiplier bits in signed multiplication. 3) Divide the resulting total bits into triplets, in case of unsigned multiplication there are 3 triplets so there are three partial product rows in this case and in case of signed multiplication there are only two triplets so there are two partial product rows in this case. These steps will be clearer in the next section that is sign extension. Multiplier bits 0 0 06

P03 P02 P01 P00 P13 P12 P11 P10 P23 P22 P21 P20 Fig 2.2 Booth encoding and partial product generation As explained above, we divide the multiplier bits in overlapping triplets. Then these triplets are given as input to the Booth encoder block. The number of encoder block is equal to the number of triplets. Each booth encoder has three bits output which are named as neg, one and two. So in case of 4 adders. The gate level diagram of booth encoder is shown below that is how three outputs are derived from three multiplier bits.

4

multiplication there are three booth encoders. However the Modified Booth

algorithm reduces the number of partial products and hence this results in the reduction of

Yi+1 ` Yi bar Yi-1bar Neg

Yi+1 Yi bar Yi-1bar Two Yi+1bar Yi Yi-1

Yi Yi-1

One

Fig 2.3 Gate level diagram of booth encoder

7

2.3 Partial product generatorThe output from the Booth encoder is used in this module to generate the partial products. Since there are three Booth encoders there will be a total of three partial product rows in case of 4

4

multiplication. The simplest partial product generator produces N partial product rows where N isthe no. bits to be multiplied. Since the amount of hardware and delay depend up on the number of partial product to be added this reduces the hardware cost and improve performance. The architecture of the partial product generator is shown in figure 2.3.

Two Xi-1

One Xi PPG

Neg

Fig 2.4 Architecture of the partial product generator The neg, one, and two are the three output bits of booth encoder and the X i and Xi-1 are the multiplicand bits. These five inputs are given to partial product generator and these results in the generation of a partial product bit. Xi is the recent and Xi-1 is the previous bit of the multiplicand. PPG is the output of the partial product generator. In case of 4

4 bit multiplications since there

are twelve partial product bits so we have to use twelve partial product generators. However the booth encoding scheme has proved very useful as it has reduced the number of partial product to almost half. 2.4 Sign extension in Booth multipliers

8

While multiplying two numbers we first have to see whether they are signed or unsigned [11]. The signed numbers are those which contain negative sign and unsigned are those which contain positive sign. Both these types of numbers are handled differently before adding them to get the final product. 2.4.1 Sign extension for unsigned multiplication The unsigned numbers are multiplied normally but further we have to see whether the partial products resulted are positive or negative. Both these types are handled differently. The partial product is positive if the neg is low that is its value is zero. And the partial product is negative if neg is high that is its value is one this will be more clear from the following figure 2.5. 0

LSB

0 0

0 0

0

0

P03 P02 P01 P00 neg0

MULTIPLIER BITS

P13 P12 P11 P10 neg1

0 0 MSB

P23 P22 P21 P20

Fig 2.5 Sign extension in booth multiplier with positive partial product The figure above is the case of unsigned 4

4 bit multiplications in which the partial products

are positive. In this case neg0 and neg1 will be zero and each partial product except is appended with zeros and extended up to last partial products last bit. Taking another case where all the partial products are negative, in this case neg is taken as one and we have to complement all the partial product bits, clearer from following figure 2.6.

9

0

LSB

1

1

1

1 neg0

MULTIPLIER BITS

1

1 neg1

0 0 MSB

Fig 2.6 Sign extension in booth multiplier with negative partial product As clear from the figure the there are two partial product rows which are negative, one can have a question in their mind that why the last partial product row is not complemented? As we have appended two zeros in the most significant position so you can see it from table the multiplier bits with two zeros in the most significant position will always produce positive partial product row. Also in this case neg0 and neg1 will be equal to one. And each partial product except last one is appended with ones and extended up to last partial products last bit. Also, each partial product is shifted two bits to the left relative to the partial product above it to account for the Modified Booth encoding of the multiplier. If a unsigned number have both negative as well as positive partial products then the same procedure as discussed in above two cases is applied. We simply complement the partial product which is negative and add neg equals to one to the partial product. But if the partial product is positive in that case do not complement it and then add neg equal to zero to it. 2.4.2 Sign extension for signed multiplication As the signed numbers are negative and these numbers are 2s complemented before they are multiplied. The following modifications are necessary for 2s complement, signed multiplication.

10

0

LSB

E

E

E

P03 P02 P01 P00 neg0

MULTIPLIER BITS

E

P13 P12

P11

P10 neg1

Fig 2.7 Sign extension in booth multiplier with signed numbers The most significant partial product, which is necessary to guarantee a positive result, is not needed for signed multiplication. That is we do not need to append the two zeros at left side of most significant bits of multiplier. So in this case we have only two partial product rows in case of 4

4 bit multiplications. As above, in this case too we can have positive as well negative

partial products. The sign extension for signed number as shown above, in this case we do not need to complement the partial product rows. We uses E a variable, its value varies as follows. Neg is 1 if partial products are negative. Neg is 0 if partial products are positive. E is 1 if multiplicand and partial product is negative or if both are positive.

E is 0 if multiplicand is negative and partial product is positive or if multiplicand is positive and partial product is negative.

2.5 Wallace tree Adder A Wallace tree is an implementation of an adder tree designed for minimum propagation delay[9]. Rather than completely adding the partial products in pairs like the ripple adder tree does,

the Wallace tree sums up all the bits of the same weights in a merged tree. Usually full adders are used, so that 3 equally weighted bits are combined to produce two bits: one (the carry) with11

weight of n+1 and the other (the sum) with weight n. Each layer of the tree therefore reduces the number of vectors by a factor of 3:2 (Another popular scheme obtains a 4:2 reduction using a different adder style that adds little delay in an ASIC implementation as shown in figure 2.5.1). The tree has as many layers as is necessary to reduce the number of vectors to two (a carry and a sum). A conventional adder is used to combine these to obtain the final product. Probably the single most important advance in improving the speed of multipliers, pioneered by Wallace, is the use of carry save adders (CSAs also known as full adders or 3-2 counters), to add three or more numbers in a redundant and carry propagate free manner. The method is illustrated in figure 2.8.

0/1 P33

P23

P32

P13 P22 P31 P03 P12 P30 P21

P02 P11 P20

P10

P01 P00

HA

FA

FA

FA

HA

HA

HA

FA

FA

HA

A7

C 5 A6

C4 A5

C3 A4

C 2 A3

C1

A2

A1

A0

Fig 2.8 4 4 Wallace tree adder However, in addition to the large number of adders required, the Wallace trees wiring is much less regular and more complicated. As a result, Wallace trees are often avoided by designers, while design complexity is a concern to them. Wallace tree styles use a log-depth tree network for reduction. Faster, but irregular, they trade ease of layout for speed. Wallace tree styles are generally avoided for low power applications, since excess of wiring is likely to consume extra power. While subsequently faster than Carry-save structure for large bit multipliers, the Wallace tree multiplier has the disadvantage of being very irregular, which complicates the task of12

coming with an efficient layout. The Wallace tree multiplier is a high speed multiplier. The summing of the partial product bits in parallel using a tree of carry-save adders became generally known as the Wallace Tree. Three step processes are used to multiply two numbers. Formation of bit products. Reduction of the bit product matrix into a two row matrix by means of a carry save adder. Summation of remaining two rows using a faster Carry Look Ahead Adder or other adders like Kogge stone adder.

a

b

c

Fig 2.9 3:2 Compressor

3:2 COMPRESSORAs shown in figure 3:2 compressors takes three inputs and produces two outputs. The equations for sum and carry are shown below.Sum = a b c

Sum

Carry

(2.4)C arry = (a b)c + (a b)a

(2.5)

Apart from 3:2 compressors, 4:2 compressors have become a topic of significant research in the arithmetic community [13]. The 4:2 compressors has transformed the standard frame of mind of counter based partial product reduction schemes by introducing the notion of horizontal data paths within stages of reduction.

13

a

b

c

d

Cout

4:2 COMPRESSOR

Cin

sum

carry

Fig 2.10.1 4:2 Compressor

As clear from the figure the 4:2 compressor takes four inputs and produces two outputs. However 4:2 compressors can be realized from 3:2 compressors. The realization will be clearer from the figure as shown below.

14

d

a

b

c

Cin

3:2 COMPRESSOR

Cout

3:2 COMPRESSOR

Sum

Carry

Fig. 2.10.2 4:2 Compressor realization using 3:2 Compressor

The Cin is initially taken as zero. The sum, carry and Cout for the 4:2 compressor can be givenSum = a b c d C in

(2.6)C ot u =(a )c +(a )a b b

(2.7) (2.8)

Carry = (a b c d)C in + (a b c d)d

However we can also use higher compressors. These are the compressors which compresses large number of inputs to lower number of outputs. For example we can use 5:2 and other higher compressors where large numbers of bits are to be multiplied. But I am designing 4 4 bit multiplier so 3:2 compressor can serve the purpose.15

2.6 Carry lookahead adder Carry Look Ahead Adder can produce carries faster due to carry bits generated in parallel by an additional circuitry whenever inputs change. This technique uses carry bypass logic to speed up the carry propagation [11]. Let ai and bi be the augends and addend inputs, ci the carry input, Si and Ci+1, the sum and carryout to the ith bit position. If the auxiliary functions, pi and gi called propagate and generate signals, then the sum is given as follows.p = a b , i i i g = a b i i i

(2.8) (2.9)

C =g +p C i i i i -1,

S = p C i i i

As we increase the number of bits in the Carry Look Ahead adders, the complexity increases because the no. of gates in the expression Ci+1 increases. So practically its not desirable to use the traditional CLA shown above because it increases the Space required and the power too. Instead we will use here Carry Look Ahead adder (less bits) in levels to create a larger CLA. Commonly smaller CLA may be taken as a 4-bit CLA. So we can define carry lookahead over a group of 4 bits. The carry lookahead logic which produces the individual group carries is illustrated in Figure 2.5. The carries are produced in two stages. Since the group G and P signals are positive logic coming from the groups, the first stage is set up in a product of sums manner (i.e. the first stage is OR-AND-INVERT logic). The first stage of the carry lookahead logic produces supergroup G and P. The second stage of the carry lookahead logic uses the supergroup G and P produced in the first stage, along with the carry-in, to make the final group carries, which are then distributed to the individual group output stages. This may be clearer from the figure below and equations written above.

16

a1

sm1 b1 a2 sm2 b2 a3 sm3

b3

a4

sm4

b4

PG Logic

Carry logic

Sum logic

Fig 2.11 4 Bit Carry Lookahead adder

A Carry look-ahead adder improves speed by reducing the amount of time required to determine carry bits. It can be contrasted with the simpler, but usually slower, ripple carry adder for which the carry bit is calculated alongside the sum bit, and each bit must wait until the previous carry has been calculated to begin calculating its own result and carry bits . A ripple-carry adder works in the same way as pencil-and-paper methods of addition. The ripple carry adder, although simple in concept, has a long circuit delay due to the many gates in the carry path from the least significant bit to the most significant bit. The carry path in the 4-bit ripple carry adder has a total of eight gates in cascade, so the circuit has a delay of eight gate delays. Since only AND and OR17

gates are involved in the carry path, ideally, the delay for each of the four carry signals produced, C1 through C4 , would be just two gate delays. The basic carry lookahead circuit is simply a circuit in which functions C1 through C3 have a delay of only two gate delays. The implementation of C4 is more complicated in order to allow the 4-bit carry lookahead adder to be extended to multiples of 4 bits, such as 16 bits. The carry lookahead adder is faster than ripple carry because some computations are done in advance. It will be clearer from the equations below. C1 = g1 + p1C0 C2 = g2 + p2 (g1 + p1C0) C3 = g3+ p3 (g2 + p2 (g1 + p1C0)) C4 = g4+ p4 (g3+ p3 (g2 + p2 (g1 + p1C0))) (2.9) (2.10) (2.11) (2.12)

As clear from above equation (2.12) part of C4 can be pre-computed after C0, p1 and g1 are known [10]. We defined a term Area-Delay Product which gave us the clear picture of the space-time tradeoff. It is worthy to note that while we consider all the adders discussed above Ripple Carry adder and Carry lookahead Adder. Because, while Ripple Carry Adders have a smaller area and lesser speed, in contrast to which Carry Select adders have high speed (nearly twice the speed f Ripple Carry Adders) and occupy a larger area. But Carry Look Ahead Adder (CLA) has a proper balance between both the Area occupied and Time required. Hence among the three, Carry Look Ahead Adder has the least AREA DELAY PRODUCT. Hence we should use Carry Lookahead Adders when it comes to optimization with both Area and Time. For an instance, the last stage of the Wallace tree Adder in Booth multiplier is a Carry look Ahead Adder. Regarding the circuit area complexity in the adder architectures, the ripple-carry adder (RCA) is the most efficient one, but the carry lookahead adder (CLA) is more complex than ripple carry adder.

18

2.7 Kogge stone adder This adder was proposed by Kogge and Stone in 1973 has minimal depth as well as bounded fanout (i.e., maximum fan-out is 2) at the cost of a massively increased number of black nodes and interconnections [3]. This is achieved by using a large number of independent tree structures in parallel. However, Kogge stone adder is nothing but prefixed carry-lookhead adder. In prefix circuit, every output depends on all inputs of equal or lower magnitude, and every input influences all outputs of equal or higher magnitude. Let

be an arbitrary associative binaryxn

operation. A prefix circuit for is a combinational circuit which takes n inputs x1, x2, . . . , xn and generates n outputs x1, x1

x2, x1 x2 x3, . . . , x1 xn as shown in Figure 2.12.x1 x2

Prefix Circuit

x1 x1 x2 x1 .. xn

Fig 2.12 Function of parallel prefix circuit It generates carry in O (logn) time and is widely considered as the fastest adder and is widely used in the industry for high performance arithmetic circuits. In KSA, carries are computed fast by computing them in parallel at the cost of increased area. Wiring congestion is often a problem for Kogge-Stone adders. The Kogge stone adder is also called tree adder. The complete functioning of KSA can be easily comprehended by analyzing it in terms of three distinct parts as shown in the next page.19

Preprocessing:-This step involves computation of generate and propagate signals

corresponding too each pair of bits in A and B. These signals are given by the logic equations below: pi = Ai xor Bi gi = Ai and Bi Carry lookahead network:-This block differentiates KSA from other adders and is the

main force behind its high performance. This step involves computation of carries corresponding to each bit. It uses group propagate and generate as intermediate signals which are given by the logic equations below: Pi:j = Pi:k+1 and Pk:j Gi:j = Gi:k+1 or (Pi:k+1 and Gk:j )(2.13) (2.14)

Post processing:-This is the final step and is common to all adders of this family (carry

look ahead). It involves computation of sum bits. Sum bits are computed by the logic given below: Si = pi xor Ci-1

a3 G3 P3

b3

a2

b2 G2 P2

a1 b1 G1 P1

a0 b0 G0 P0

G 3:2 P3:2

G 2:1 P2:1

G 1:0 P1:0

G0 =C0

G 3:0 P3:0

G 2 =C2

G 1 =C1

Fig 2.12 4bit Kogge Stone adder

20

As shown in above example of 4 bit Kogge-Stone adder. Each vertical stage produces a "propagate" and a "generate" bit, as shown [6]. The culminating generate bits (the carries) are produced in the last stage (vertically), and these bits are XOR'd with the initial propagate after the input (the red boxes) to produce the sum bits. e.g., the first (least-significant) sum bit is calculated by XORing propagate in the farthest-right red box (a "1") with the carry-in (a "0"), producing a "1". The second bit is calculated by XORing propagate in second box from the right (a "0") with C0 (a "0"), producing a "0". The equations for PG logic written above are for valency 2 group PG logic because it combines pair of smaller groups. When large numbers of groups are combined then that in valency 4 group logic the equations [11] for PG logic are: Pi:j = Pi:k and Pk-1:i and Pl-1:m and Pm-1:j Gi:j = Gi:k or Pi:k (Gk-1:l or Pk-1:l (Gl-1:m or Pl-1:m and Gm-1:j )), (i>=k>m>j)(2.15) (2.16)

The Kogge stone adder is faster one out of other CLA based tree adder. However Kogge stone adders in higher radix are also available. The logic depth of higher (radix-4) KSA adder is less but each stage is more complex than radix 2 Kogge stone adders. The 4 bit radix-2 is shown above. The 8 bit radix-4 Kogge is shown below.

Input

Carry logic

Sum S7 S6 S5 S4 S3 S2 S1 S0.

Fig 2.13 8 bit Radix-4 Kogge stone adder

21

CHAPTER 3DYNAMIC CMOS CIRCUITS3.1 Introduction Although there are many positive reasons for using static CMOS logic, there are also numerous drawbacks. Static devices inherently have more components and clocked transistors than dynamic devices. A full latch for example in the traditional static configuration may require 66 transistors. A dynamic configuration performing the same function may require only 36 transistors. The number of transistors used to construct a flip-flop is also significantly reduced by using dynamic logic as opposed to fully static logic. Reducing the total number of transistors not only allows the overall device to be significantly smaller, but also reduces the power requirements of the system. Most of the disadvantages of using static CMOS, however, are associated with the use of PMOS because hole mobilities are significantly slower than electron mobilities, PMOS devices must be much larger than NMOS devices for the two to have the same ability to transport a fixed amount of charge during a fixed time interval. The larger surface area needed to form a PMOS device than an NMOS device is not only a detriment to the overall chip size, but also increases the capacitance associated to the PMOS device. The larger capacitance and slower carrier mobilities associated with PMOS cause results in a greater time delay for the PMOS to charge up the capacitor associated with the next logic stage. This increased time delay becomes a bottleneck when trying to design faster circuits. In standard CMOS logic, one PMOS device will always compliment an NMOS device. Altering this logic so that fewer PMOS devices are needed will vastly improve circuit performance. An alternative logic that reduces the number of PMOS devices while also solving most of the problems associated with pseudoNMOS logic is dynamic CMOS. The basic structure of dynamic CMOS logic is shown in Fig 3.1 [11]. When the clock is low, the NMOS device is cutoff while the PMOS is turned ON. This has the effect of disconnecting the output node from ground while simultaneously connecting the node to VDD. Since the input to the next stage is charged through the PMOS transistor when the clock is low, this phase of the clock is known as the precharge phase. When clock is high however, the PMOS is cutoff and the bottom NMOS is turned ON, thereby disconnecting the output node from VDD and providing a possible pull-down path to ground through the bottom22

NMOS transistor. This part of the clock cycle is known as the evaluation phase, and so the bottom NMOS is called the evaluation NMOS. When the clock is in the evaluation phase, the output node will either be maintained at its previous logic level or discharged to GND.

Vdd

Clk Y A

GndFig 3.1 Basic Dynamic CMOS circuit

In other words, the output node may be selectively discharged through the NMOS logic structure depending upon whether or not a path to GND is formed due to inputs of the NMOS logic block. If a path to ground is not formed during the evaluation phase, the output node will maintain its previous voltage level since no path exists from the output to VDD or GND for the charge to flow away. 3.2 Footed dynamic circuit If the input A is high during precharge ,contention will take place because both the PMOS and NMOS transistors will be ON. When the input cannot be gauranteed to be zero during the precharge ,an extra clocked evaluated transistor can be added to the bottom of nmos stack to avoid contention. The extra transistor is called foot . Due to the fact that we have removed contention the logic effort get improved footed transistor have higher logical effort then unfooted.23

Vdd

Y Clk A

Gnd

Fig 3.2 Footed dynamic circuit

.3.3 Advantages of Dynamic circuits No static power dissipation The dynamic circuits are not having any power dissipation when there is no circuit activity i.e.there is no change in inputs occurs. The dynamic circuit dissipates power when the inputs are active i.e. when the input switches from one state to another. Higher speed Dynamic logic is faster than the normal logic. It uses only fast N transistors that is it use more no. of nmos than pmos but in static circuits more no of PMOS are used to represent a logic. A pmos is slower than nmos as the mobility of holes in PMOS is slower than the mobility of electrons in nmos. So dynamic are circuits faster than static. The example is shown in figure 3.3 and 3.4.

24

Vdd

Vdd

Vdd

A

BY

Y A

Clk

A

B

B

GndGnd

Fig 3.3 Static NAND

Fig 3.4 Dynamic NAND

Low power requirement A static circuit uses more no. of transistor than dynamic for e.g. static latch require 66 transistors and a dynamic latch requires only 36 transistors [7]. Reducing the no. of transistor not only allows the overall device to be significantly smaller but also reduces power requirement of a system. 3.4 Monotonicity Requirement in dynamic circuits A fundamental difficulty with the dynamic circuit is the monotonicity requirement [6]. While a dynamic gate is in evaluation, the input must be monotonically rising. That is, the input can start LOW and remain LOW , start LOW and go HIGH, start HIGH and can remain HIGH .As shown in figure 3.6 the dynamic inverter violates monotonicity. During precharge, the output is pulled HIGH. When the clock rises, the input is HIGH so the output is discharged LOW through the pull down network. The input falls later LOW, turning off the pull down network. However the precharge transistor is also off so the output floats, staying LOW rather than rising .The output will remain low until the next precharge step.

25

In summary the inputs must be monotonically rising for the dynamic gate to compute correct function.Vdd

Precharge transistorY Clk A

Gnd

Fig 3.5 Dynamic circuit

A Precharge Clk Y evaluation

Fig 3.6 Monotonicity in dynamic circuits Unfortunately the output of a dynamic gate begins HIGH and monotonically falls LOW during evaluation. This monotonically falling output is not suitable input to second gate. So dynamic gates sharing same clock cannot be directly connected. So a solution to overcome this problem is required. In the next chapter the solution to overcome monotonicity is discussed in details.

26

3.5 Domino logic dynamic circuits The monotonicity problem can be solved by placing a static CMOS inverter between dynamic gates as shown below in figure 3.7. This converts the monotonically falling output in to monotonically rising signal suitable for the next gate. The dynamic static pair together is called DOMINO GATE [7] because precharge resembles setting up a chain of dominos tipping over, each triggering the next. A single clock can be used to precharge and evaluate all the logic gates within the chain. The dynamic output is monotonically falling during evaluation, so the static inverter output is monotonically rising.Vdd Vdd Vdd Vdd

R A

W

X Y

Clk

B Gnd

B Gnd

Gnd

Gnd

Fig 3.7 Two dynamic NAND gates sharing same clock The dynamic-static combination is known as a domino gate. This is analogous to a chain of dominoes - the precharge represents setting up of dominoes and the evaluation represents their sequential triggering. No doubt, the domino circuit has removed the problem of monotonicity but it further has certain disadvantages. The problems associated with domino are non-inverting output and the charge sharing problem.

27

As it is clear that by placing a inverter solves the problem of monotonicity. Therefore we are getting correct output Y. This will be clear from the figure below.

Clk W R

Precharge

Evaluation

X YFig 3.8 Output waveforms of two dynamic NAND gates sharing same clock 3.5.1 Properties of Domino Logic A single clock can be used to precharge/evaluate each stage in a chain Precharge occurs in parallel, but evaluation occurs sequentially The static inverter can in general be replaced by a static gate Unlike static CMOS gates, domino gates are inherently non-inverting The gate is capable of very high speed.

28

3.5.2 Example: - XOR gate using domino

Vdd Vdd

Y A Clk B B Gnd A

Gnd

Fig 3.9 Domino Logic Xor gate When clock is in the precharge phase PMOS is On whereas pull down NMOS is Off, output is high (Vdd) at the dynamic node , therefore in the precharge phase whatever the inputs are , the output always remain high at the dynamic node and final output is low after passing through inverter. In the evaluation phase clock goes low, therefore the pull up PMOS is Off and pull down NMOS is On. The output now is evaluated based up on the status of inputs. 3.5.3 Advantages of domino logic circuits Faster Switching speed: - The study of dynamic circuit shows that dynamic circuits are

having faster switching speed than static cmos. Therefore they greatly serve the need of CMOS industry which requires faster logic to perform million of functions at a time.29

No short circuit current:-The domino logic circuit is not having any short circuit

current means less power dissipation. Therefore this is best suited for low power applications. 3.5.4 Disadvantages of domino logic circuits Non-inverting output The domino circuits produce only the non -inverting output however certain logic synthesis operations require inverting as well as non -inverting operation in the same circuit. So there is a need of some logic with inverting as well as non-inverting function. Charge sharing Charge sharing [5] is an undesirable signal integrity phenomenon observed most commonly in the domino logic family of digital circuits. The charge sharing problem occurs when the charge which is stored at the output node in the phase is shared among the output or junction capacitances of transistors which are in the evaluation phase. Charge sharing may degrade the output voltage level or even cause erroneous output value. Clock overloading

In the domino logic circuits clock is associated with every PMOS, therefore in case of large or cascading domino logic circuits overloading occurs. So more the no. of PMOS transistor more will be the clocks associated and hence more is the power dissipation.

30

3.5.5 Keeper The dynamic circuits also suffer from charge leakage on the dynamic node. If the dynamic node is prechaged high and then left floating, the voltage on the dynamic node will drift over the time due to subthreshold, gate, and, junction leakage. This problem is analogous to the leakage in dynamic RAM. Moreover the dynamic circuits have poor noise margins. If the input rises above Vt while the gate is in evaluation phase, the input transistor till turn ON weakly and incorrectly discharge the output. Both leakage and noise margin problem can be reduced by adding a keeper. The keeper is a weak that holds the output at the correct level when it would otherwise float [10]. When the dynamic node X is high and the output Y is low and keeper is ON to prevent X from floating. When X fall, the keeper initially opposes the transistor so it must be weaker than the pull down network. Eventually Y rises turning the keeper OFF and thus avoiding static power dissipation.Vdd Keeper

Vdd

Vdd

X Clk A Gnd B

Y

Gnd

Fig 3.10 Dynamic Nand gate with Keeper The keeper must be stong enough to compentiate for any type of leakage drawnwhen the output is floating and the pull down stack is OFF. Strong keeper also improves the noise margin31

because when theinput is slighly above Vt the keeper can supply enough current to hold the dynamic circuit. The keeper width should be carefully decided because too strong keeper may create contention with the pulll down network and too weak may not be useful to hold the output node to its correct value.

32

Chapter 4LEAKAGE IN CMOS CIRCUITS4.1 Leakage in CMOS circuits Low-power consumption in high performance VLSI circuits is highly desirable aspect as it directly relates to battery life, reliability, packaging, and heat removal costs .With the continuous trend of technology scaling, leakage power is becoming major contributor to the total power consumption in CMOS circuits. Scaling of Vdd reduces dynamic power consumption but degrades the performance of the circuit as well. This can be partially compensated by lowering Vth but at the cost of increased leakage power. Minimizing leakage power consumption is currently an extremely challenging area of research, especially with on-chip devices doubling every two years. Leakage power dissipation [4] arises from the leakage currents flowing through the transistor when there are no input transitions and the transistor has reached steady state. Unlike dynamic power, leakage power depends on the total number of transistors in the circuit, their types, and their operation status regardless of their switching activity. This makes it more difficult to attempt to reduce leakage power than to reduce dynamic power. Leakage current constitutes only of subthreshold leakage, which is pattern dependent as it only occurs in OFF transistors. Hence, this necessitates the need for robust techniques to reduce this leakage power dissipation. To this effect, several techniques have been proposed that efficiently minimize leakage power dissipation. Leakage power has two main forms in modern IC processes: Subthreshold leakage and gate leakage. Subthreshold leakage power is due to a non-zero current between the source and drain terminals of an OFF MOS transistor. With each process generation, supply voltages are reduced and transistor threshold voltages (Vth) must also be reduced to mitigate performance degradations. Reducing Vth leads to an exponential increase in subthreshold leakage. Gate leakage on the other hand is due to tunneling current through the gate oxide of an MOS transistor. In modern IC processes, gate oxides are thinned to improve transistor drive capability, which has led to a considerable increase in gate leakage.

33

4.1.1 Subthreshold leakage Subthreshold current is the most dominant among all sources of leakages [8]. It is caused by minority carriers drifting across the channel from drain to source due to presence of weak inversion layer when the transistor is operating in cutoff region VGS < Vth. The minority carrier concentration rises exponentially with gate voltage VG. ISUB depends on the substrate doping concentration and halo implant, which modifies the threshold voltage VTH. ISUB also rises exponentially with temperature. Leakage power dissipation has become a considerable proportion of the total power dissipated in modern deep submicron technologies. The following equation relates subthreshold current ISUB with other device parameters.Vds (Vgs - V - V - V ) th0 ds sb (1 - e V ) ISUB = I o e nV W 2 1.8 Io = C ox V e L KT V = q

(4.1)

(4.2)

(4.3) W and L are the width and length of the transistor, is carrier mobility, V is a thermal voltage

is the Drain Induced Barrier Lowering coefficient and n is the slope shape factor/subthresholdThe dependence of subthreshold current on above parameter can be summed up in a table. The leakage current occurrence in NMOS can be seen in figure 4.1. Table 4.1 Dependence of subthreshold leakage current on device parameters Parameter Transistor Width(W) Transistor Length(L) Input Voltage (Vgs) Temperature (T) Dependence Directly proportional Inversely proportional Exponential increase Exponential increase

Above table provides the clear view of the dependence of subthreshold current on transistor width, length, input voltage and temperature. It is good to have the knowledge of these parameters because by knowing these parameters the subthreshold leakage can be avoided. However increasing and decreasing of these parameters may have adverse effect. In deepsubmicron processes, Vdd and VTH MOS transistors have greatly reduced. This effects extent34

reduction in the switching power dissipation. Exponential behavior of subthreshold leakage current thereby increases static power dissipation. Static power consumption is the product of the device leakage current and the supply voltage. PS = (leakage current)

(supply voltage)

So it is clear from above with increase in leakage current and supply voltage the static power VG < VT dissipation will increase. However there are many remedies to overcome static power dissipation these will be discussed in next section. Portable battery operated devices that have long idle times are particularly affected by this leakage power loss. It remains idle for a majority of time. However, since it is not turned off, valuable battery power is drained. This reduces battery service life. Existing designs must therefore be modified his work analyzes the proposed techniques with circuit performance metrics such as leakage power, dynamic power and propagation delay forming the basis. VS=0 VD=VDD

n+ ISUB

n+ ++ + +++

VB=0 Fig 4.1 Illustration of subthreshold leakage in NMOS transistor.

4.1.2 Gate leakage Gate leakage is a current flowing into the gate of the transistor, called Tunneling. Gate leakage is a serious concern at gate oxide thicknesses below 2 nm [8]. With such thin gate oxide, fairly small potential difference across the gate oxide can induce high electric field, causing electrons35

to easily tunnel into/through the oxide. The two main tunneling phenomena that lead to gate leakage currents are FowlerNordheim Tunneling and Direct Tunneling. The tunneling probability of an electron depends on the thickness of the barrier, the barrier height, and the structure of the barrier. FN tunneling occurs at very high fields only[7]. Regular tunneling is a more common phenomenon in 45 nm Bulk-CMOS devices. The reason why gate leakage was neglected up till recent years is because of the fact that tunneling drops exponentially with the increase in gate oxide thickness, so for older processes with tox greater than 2nm, the gate leakage was much smaller than the subthreshold leakage (ISUB

) and therefore was neglected. But with

current process technology parameters, gate leakage has already increased to more than double the subthreshold current and will continue to increase in a much higher rate mandating the use of high-k materials other than Silicon Dioxide to enable the use of thicker oxide thicknesses and/or the use of a different gate material other than polysilicon such as metal gate. As technology advances, tox decreases by 30% with every technology generation and for ox t smaller than 1.4nm, IGate rises by about 1000X in the following process technology step, while I SUB rises by about 5X under normal scaling theory. As an example, NECs 100nm process technology has a tox =1.6nm, I Sub of 0.3 nA/m of gate width, and an NMOS IGate of 0.65 nA/m. IGate has already increased to more than double ISUB in some cases. Tunneling current in PMOS devices is an order of magnitude smaller than tunneling in NMOS devices because the holes in PMOS devices have to pass a higher barrier to tunnel (holes tunnel from the valence band). Gate leakage complicates CMOS circuit operation and analysis as gates have no infinite impedances any more as was previously assumed. Figure 4.1 shows the possible transistor states that will cause gate leakage current to flow in an NMOS device. The input vectors have an impact on the gate leakage which is different from that on the subthreshold leakage. With 10 as the input vector, |Vs|=Vth_b (the threshold voltage considering body (effect), whereas with 00 as the input |Vg1|=Vm (