ΗΜΥ 307 ΨΗΦΙΑΚΑ ΟΛΟΚΛΗΡΩΜΕΝΑ ΚΥΚΛΩΜΑΤΑ … 14-15-ece... · mirror...

ΗΜΥ 307ΨΗΦΙΑΚΑ ΟΛΟΚΛΗΡΩΜΕΝΑ

ΚΥΚΛΩΜΑΤΑΕαρινό Εξάμηνο 2018

ΔΙΑΛΕΞΕΙΣ 14-15: Κυκλώματα Αριθμητικής και Λογικής

ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ ([email protected])(ack: Prof. Mary Jane Irwin and Vijay Narayanan)

[Προσαρμογή από “Rabaey’s Digital Integrated Circuits, ©2002, J. Rabaey et al.”]

mailto:[email protected]

ΗΜΥ307 Δ14-15 Arithmetic and Logic Circuits .2 © Θεοχαρίδης, ΗΜΥ, 2018

Review: Basic Building Blocksl Datapath

¤ Execution units¢ Adder, multiplier, divider, shifter, etc – Today!

¤ Register file and pipeline registers – MEMORY – See below!¤ Multiplexers, decoders, etc. – THIS lecture (and L.15)

l Control¤ Finite state machines (PLA, ROM – Lecture 17)

l Interconnect¤ Switches, arbiters, buses – Lecture 16

l Memory¤ Caches (SRAMs), TLBs, DRAMs, buffers

¤ Lecture 17


The 1-bit Binary Adder

1-bit Full Adder(FA)

A

BS

Cin

S = A Å B Å CinCout = A&B | A&Cin | B&Cin (majority function)

q How can we use it to build a 64-bit adder?

q How can we modify it easily to build an adder/subtractor?

q How can we make it better (faster, lower power, smaller)?

A B Cin Cout S carry status0 0 0 0 0 kill0 0 1 0 1 kill0 1 0 0 1 propagate0 1 1 1 0 propagate1 0 0 0 1 propagate1 0 1 1 0 propagate1 1 0 1 0 generate1 1 1 1 1 generate

Cout

G = A&BP = A Å BK = !A & !B

= P Å Cin

= G | P&Cin


FA Gate Level Implementations

A B

S

Cout

Cin

t1 t0t2 t0

t1

A B

S

Cout

Cin

t2

q The way you learned to design in ECE 210 and ECE 211


Review: XOR FA

Cout

S

Cin

A

B

16 transistors


Review: CPL FA

A

!A

B!B Cin!Cin

!S

S

Cout

!CoutA

!A

B

!B

!B

B Cin !Cin

Cin

!Cin

20+8 transistors, dual rail – beware of threshold drops


Review: Mirror Adder

B

B B

B B

BB

BA

A

A

A

A

A A

A

Cin

Cin

Cin

Cin

Cin!Cout !S

24+4 transistors

kill

generate

0-propagate

1-propagate

Cout = A&B | B&Cin | A&Cin SUM = A&B&Cin | COUT&(A | B | Cin)

4 4

4 4

4

8

888

8

2 2 23

3

3

6

6

6444

4

2

Sizing: Each input in the carry circuit has a logical effort of 2 so the optimal fan-out for each is also 2. Since !Cout drives 2 internal and 2 inverter transistor gates (to form Cin for the nms bit adder) should oversize the carry circuit. PMOS/NMOS ratio of 2.


Mirror Adder Featuresl The NMOS and PMOS chains are completely symmetrical with

a maximum of two series transistors in the carry circuitry,guaranteeing identical rise and fall transitions if the NMOS and PMOS devices are properly sized.

l When laying out the cell, the most critical issue is the minimization of the capacitances at node !Cout (four diffusion capacitances, two internal gate capacitances, and two inverter gate capacitances). Shared diffusions can reduce the stack node capacitances.

l The transistors connected to Cin are placed closest to the output.

l Only the transistors in the carry stage have to be optimized for optimal speed. All transistors in the sum stage can be minimal size.


A 64-bit Adder/Subtractor

1-bit FA S0

C0=Cin

C1

1-bit FA S1

C2

1-bit FA S2

C3

C64=Cout

1-bit FA S63

C63

. . .

q Ripple Carry Adder (RCA) built out of 64 FAs

q Subtraction – complement all subtrahend bits (xorgates) and set the low order carry-in

q RCA

l advantage: simple logic, small (low cost)

l disadvantage: slow (O(N) for N bits) and lots of glitching (so lots of energy consumption)

A0

B0

A1

B1

A2

B2

A63

B63

add/subt


Ripple Carry Adder (RCA)

A0 B0

S0

C0=CinFA

A1 B1

S1

FA

A2 B2

S2

FA

A3 B3

S3

FACout=C4

T = O(N) worst case delay

Tadder » TFA(A,B®Cout) + (N-2)TFA(Cin®Cout) + TFA(Cin®S)

Real Goal: Make the fastest possible carry path


Inversion Property

A B

S

CinFA

!Cout (A, B, Cin) = Cout (!A, !B, !Cin)

Cout

A B

S

FACout Cin

!S (A, B, Cin) = S(!A, !B, !Cin)

º

q Inverting all inputs to a FA results in inverted values for all outputs


Exploiting the Inversion Property

A0 B0

S0

C0=CinFA’

A1 B1

S1

FA’

A2 B2

S2

FA’

A3 B3

S3

FA’Cout=C4

Now need two “flavors” of FAs

regular cellinverted cell

q Minimizes the critical path (the carry chain) by eliminating inverters between the FAs (will need to increase the transistor sizing on the carry chain portion of the mirror adder).


Manchester Carry Chainl Switches controlled by Gi and Pi

l Total delay of¤ time to form the switch control signals Gi and Pi¤ setup time for the switches¤ signal propagation delay through N switches in the worst case

Gi Pi

!Ci!Ci+1

clk


4-bit Sliced MCC Adder

G P

!C0

clk

G PG PG P

ÅÅÅÅ

& Å& Å& Å& Å

A0 B0A1 B1A2 B2A3 B3

S0S1S2S3

!C1!C2!C3

!C4


Domino Manchester Carry Chain Circuit

Ci,0G0

clk

clkP0P1P2P3

G1G2G3

Ci,41 2 3 4

5

6

3 3 3 3 3

1

2

2

3

3

4

4

5

!(G0 | P0 Ci,0)

!(G1 | P1G0 | P1P0 Ci,0)

!(G2 | P2G1 | P2P1G0 | P2P1P0 Ci,0)

!(G3 | P3G2 | P3P2G1 | P3P2P1G0 | P3P2P1P0 Ci,0)


Binary Adder Landscapesynchronous word parallel adders

ripple carry adders (RCA) carry prop min adders

signed-digit fast carry prop residue adders adders adders

Manchester carry parallel conditional carry carry chain select prefix sum skip

T = O(N), A = O(N)

T = O(1), A = O(N)

T = O(log N)A = O(N log N)

T = O(ÖN), A = O(N)T = O(N)

A = O(N)


Carry-Skip (Carry-Bypass) Adder

If (P0 & P1 & P2 & P3 = 1) then Co,3 = Ci,0 otherwise the block itself kills or generates the carry internally

A0 B0

S0

Ci,0FA

A1 B1

S1

FA

A2 B2

S2

FA

A3 B3

S3

FACo,3

Co,3

BP = P0 P1 P2 P3 “Block Propagate”


Carry-Skip Chain Implementation

BPblock carry-in

block carry-outcarry-out

Cin

G0

P0P1P2P3

G1G2G3

!Cout

BP


4-bit Block Carry-Skip Adder

Worst-case delay ® carry from bit 0 to bit 15 = carry generated in bit 0, ripples through bits 1, 2, and 3, skips the middle two groups (B is the group size in bits), ripples in the last group from bit 12 to bit 15

Ci,0

Sum

CarryPropagation

Setup

Sum

CarryPropagation

Setup

Sum

CarryPropagation

Setup

Sum

CarryPropagation

Setup

bits 0 to 3bits 4 to 7bits 8 to 11bits 12 to 15

Tadd = tsetup + B tcarry + ((N/B) -1) tskip +B tcarry + tsum


Optimal Block Size and Timel Assuming one stage of ripple (tcarry) has the same delay as

one skip logic stage (tskip) and both are 1TCSkA = 1 + B + (N/B-1) + B + 1

tsetup ripple in skips ripple in tsumblock 0 last block

= 2B + N/B + 1l So the optimal block size, B, is

dTCSkA/dB = 0 Þ Ö(N/2) = Bopt

l And the optimal time isOptimal TCSkA = 2(Ö(2N)) + 1


Carry-Skip Adder Extensionsl Variable block sizes

¤ A carry that is generated in, or absorbed by, one of the inner blocks travels a shorter distance through the skip blocks, so can have bigger blocks for the inner carries without increasing the overall delay

CinCout

q Multiple levels of skip logic

skip level 1

skip level 2

CinCout

AND of the first level skip signals (BP’s)


Carry-Skip Adder Comparisons

0

10

20

30

40

50

60

70

8 bits 16 bits 32 bits 48 bits 64 bits

RCACSkAVSkA

B=2 B=3B=4

B=5B=6


Parallel Prefix Adders (PPAs)l Define carry operator € on (G,P) signal pairs

¤ € is associative, i.e.,[(g’’’,p’’’) € (g’’,p’’)] € (g’,p’) = (g’’’,p’’’) € [(g’’,p’’) € (g’,p’)]

€

(G’’,P’’) (G’,P’)

(G,P)

whereG = G’’ Ú P’’G’P = P’’P’

€

€ €

€

G’!G

G’’

P’’


PPA General Structurel Given P and G terms for each bit position, computing all the

carries is equal to finding all the prefixes in parallel(G0,P0) € (G1,P1) € (G2,P2) € … € (GN-2,PN-2) € (GN-1,PN-1)

l Since € is associative, we can group them in any order ¤ but note that it is not commutative

q Measures to consider● number of € cells● tree cell depth (time)● tree cell area● cell fan-in and fan-out● max wiring length● wiring congestion● delay path variation (glitching)

Pi, Gi logic (1 unit delay)

Si logic (1 unit delay)

Ci parallel prefix logic tree (1 unit delay per level)


Brent-Kung PPAPa

ralle

l Pre

fix C

ompu

tatio

n €

G0P0

G1P1

G2p2

G3P3

G4P4

G5P5

G6P6

G7P7

G8P8

G9p9

G10P10

G11p11

G12P12

G13p13

G14p14

G15p15

€€€€€€€

€ € € €

€

€

€

€

€

€

€ € € € € €

€ €

C1C2C3C4C5C6C7C8C9C10C11C12C13C14C15C16

Cin

€

T =

log 2

NT

= lo

g 2N

-2

A =

2log

2N

A = N/2


Kogge-Stone PPF AdderPa

ralle

l Pre

fix C

ompu

tatio

n

€

G0P0

G1P1

G2P2

G3P3

G4P4

G5P5

G6P6

G7P7

G8P8

G9P9

G10P10

G11P11

G12P12

G13P13

G14P14

G15P15

€€€€€€€

€ € € €

€

€

€

€

C1C2C3C4C5C6C7C8C9C10C11C12C13C14C15C16

Cin

€

T =

log 2

N

A =

log 2

N

A = N

€€€€€€€

€ € € € € € € € € €

€ € € € € € € € € €

€ € € € € €

Tadd = tsetup + log2N t€ + tsum


Normalized Delay - Adder Comparisons

0

10

20

30

40

50

60

70

8 bits 16 bits 32 bits 48 bits 64 bits

RCACSkAVSkAKS PPA


Multiply Operationl Multiplication as repeated additions

multiplicandmultiplier

partialproductarray

double precision product

N

2N

N can be formed in parallel


Shift & Add Multiplicationl Right shift and add

¤ Partial product array rows are accumulated from top to bottom on an N-bit adder

¤ After each addition, right shift (by one bit) the accumulated partial product to align it with the next row to add

¤ Time for N bits Tserial_mult = O(N Tadder) = O(N2) for a RCA

q Making it faster● Use a faster adder● Use higher radix (e.g., base 4) multiplication

- Use multiplier recoding to simplify multiple formation

● Form partial product array in parallel and add it in parallelq Making it smaller (i.e., slower)

● Use an array multiplier- Very regular structure with only short wires to nearest neighbor

cells. Thus, very simple and efficient layout in VLSI- Can be easily and efficiently pipelined


Tree Multiplier Structure

partial productarray reduction tree

fast carry propagate adder (CPA)

P (product)

mux + reductiontree (log N)+CPA (log N)

Q (‘ier)

D (‘icand)

DD

D

0

00

0

multiple forming circuits


(4,2) Counterl Built out of two (3,2) counters (just FA’s!)

¤ all of the inputs (4 external plus one internal) have the same weight (i.e., are in the same bit position)

¤ the internal output is carried to the next higher weight position (indicated by the )

(3,2)

(3,2) Note: Two carry outs - one “internal” and one “external”


Tiling (4,2) Counters

l Reduces columns four high to columns only two high¤ Tiles with neighboring (4,2) counters¤ Internal carry in at same “level” (i.e., bit position weight) as the

internal carry out

(3,2)

(3,2)

(3,2)

(3,2)

(3,2)

(3,2)


4x4 Partial Product Array Reduction

multiplicandmultiplier

partialproductarray

reduced pp array (to CPA)

double precision product

q Fast 4x4 multiplication using (4,2) counters


8x8 Partial Product Array Reduction‘icand‘ier

partialproductarray

reduced partial product array

How many (4,2) countersminimumare needed to reduce it to 2 rows?

Answer: 24


Alternate 8x8 Partial Product Array Reduction‘icand‘ier

partialproductarray

reduced partial product array

More (4,2) counters, so what is the advantage?


Array Reduction Layout Approach

multiple generators

multiplicand

multiple selection signals(‘ier)

. . .2(4,2) counter slice

(4,2) counter slice

(4,2) counter slice

CPA


Parallel Programmable Shifters

Dat

a In

Control =

Dat

a O

ut

Shift amountShift directionShift type (logical,

arith, circular)

Shifters used in multipliers, floating point units

Consume lots of area if done in random logic gates


A Programmable Binary Shifter

rgt nop left

Ai

Ai-1 Bi-1

BiAi Ai-1 rgt nop left Bi Bi-1A1 A0 0 1 0 A1 A0

A1 A0 1 0 0 0 A1

A1 A0 0 0 1 A0 0


4-bit Barrel Shifter

A0

A1

A2

A3

B0

B1

B2

B3

Sh1

Sh2

Sh3

Sh0 Sh1 Sh2 Sh3

Example: Sh0 = 1B3B2B1B0 = A3A2A1A0

Sh1 = 1B3B2B1B0 = A3A3A2A1

Sh2 = 1B3B2B1B0 = A3A3A3A2

Sh3 = 1B3B2B1B0 = A3A3A3A3

Area dominated by wiring


4-bit Barrel Shifter Layout

BufferSh3Sh2Sh1Sh0

A3

A2

A1

A0

Widthbarrel ~ 2 pm NN = max shift distance, pm = metal pitch

Delay ~ 1 fet + N diff caps

Widthbarrel

Only one Sh#active at a timel


8-bit Logarithmic Shifter

A3

A2

A1

A0

!Sh1Sh1 !Sh2Sh2 !Sh3Sh3

B0

B1

B2

B3


8-bit Logarithmic Shifter Layout Slice

Widthlog ~ pm(2K+(1+2+…+2K-1)) = pm(2K+2K-1)K = log2 N

Delay ~ K fets + 2 diff caps

A0

B3

B2

B1

B0

A1

A2

A3

1 2 4


Shifter Implementation Comparisons

N K

Barrel Logarithmic

Width Speed Width Speed2 N pm 1 + N diffs pm(2K+2K-1) K + 2 diffs

8 3 16 pm 1 + 8 13 pm 3 + 216 4 32 pm 1 + 16 23 pm 4 + 232 5 64 pm 1 + 32 41 pm 5 + 264 6 128 pm 1 + 64 75 pm 6 + 2


Decodersl Decodes inputs to activate one of many outputs

¤ two inverters, four 2-input nand gates, four inverters plus enable logic

¤ how about for a 3-to-8, 4-to-16, etc. decoder?

In0

In1

Enable

Out0 = !In1 & !In0

Out1 = !In1 & In0

Out2 = In1 & !In0

Out3 = In1 & In0

2x4


Dynamic NOR Decoder

Vdd GND GND

A0 !A0 A1 !A1

B0

B1

B2

B3

precharge


Dynamic NAND Decoder

GND

A0 !A0 A1 !A1

B3

precharge

B2

B1

B0


Building Big Decoders from Small

1x2

A4

enable

A3 A2

2x4

2x4

A1 A0

2x4

2x4

.

.

.

0 0 0 0 1

1 ® 0 ® 1

Active low enable Active low output


Multiplexersl Selects one of several inputs to gate to the single output

¤ two inverters, four 3-input nands, one 4-input nand¤ how about for an 8x1, 16x1, etc. mux?

In0

S1 S0

Out = In0 & !S1 & !S0 |In1 & !S1 & S0 |In2 & S1 & !S0 |In3 & S1 & S0

In1

In2

In3

4x1


Review: TG 2x1 Multiplexer

GND

VDD

In1 In2S S

S S

S

S

!S

In2

In1

F

F

F = !((In1 & S) | (In2 & !S))


Building Big Muxes from Small

A0

S0

A12x1

A2

A32x1

2x1

S1

Out


Review: Datapath Bit-Sliced OrganizationControl Flow

Bit 0

Bit 1

Bit 2

Bit 3

Tile identical bit-slice elements

Reg

iste

r File

Pipe

line

Reg

iste

r

Adde

r

Shift

er

Pipe

line

Reg

iste

r

Mul

tiple

xer

Mul

tiple

xer

Data Flow

Pipe

line

Reg

iste

r

From I$

Pipe

line

Reg

iste

r

To/From D$

decoder


Layout of Bit-Sliced Datapaths


Layout of Bit-sliced DatapathsWithout feedthroughs or pitch matching (4.2µm2)

With feedthroughs (3.2µm2)

With feedthroughs and pitch matching (2.2µm2)


Alpha 21264 Integer Unit DatapathMultimedia engine

Shifter

Intercluster bypass

Adder

Logic box

Register fileRegister

file decoder

Logic box

Adder

Intercluster bypass

Load bypass

Store FIFO

Address drivers

tristate bus driver

bus driver

RC1_0RC1_1

RC2_0

RC2_1LSD_1LSD_0to D$

ΗΜΥ 307 ΨΗΦΙΑΚΑ ΟΛΟΚΛΗΡΩΜΕΝΑ ΚΥΚΛΩΜΑΤΑ … 14-15-ece... · mirror...

Documents