ece/cs 565: vlsi cad flow - university of illinois at chicago · pdf file ·...

Post on 24-Mar-2018

222 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ECE 565: VLSI CAD Flow + Brief Intro to HLS, Logic Optimization &

Technology Mapping

Shantanu Dutt

ECE Dept., UIC

VLSI CAD Flow: Overview (High-Level Synthesis or HLS network of interconnected modules: functional units [FUs], registers, muxes, demuxes,

Logic optimization (mainly of controller fsm’s and other “random”logic; other module dessigns such as arith. FUs, regs, muxes, etc. are well known).

a) Design Rule Checking b) Layout Vs. Schematic Verification c) Electrical Rule Checking d) Electrical simulation and functional (logic) + metric (delay, power) verification.

a) Tech. Mapping (mapping small subckts to pre-designed cells in library). b) Physical Synthesis (inserting buffers, upsizing gates, deciding on multiple Vdd and Vth). Can be combined w/ PD.

(HLS)

(E.g., Using a Hardware Description Language: Verilog, VHDL, SystemC or Schematic Capture)

Iterate if simulation identifies metric problems (hot spots, timing violation, crosstalk)

Shantanu Dutt, UIC

System to Silicon Design

System Requirements Hardware Architecture

Σ

SynthesisAlgorithm

X[k] = Σ x[n]e-j2 πk/N

x[n] = Σ X[k]e+j2πk/N

System Requirements Hardware Architecture

Σ

SynthesisAlgorithm

X[k] = Σ x[n]e-j2 πk/N

x[n] = Σ X[k]e+j2πk/N

Advanced Reliable Systems (ARES) Lab. Jin-Fu Li, EE, NCU 6

[Source: MITRE]

Design For Test

Cont rol

Observe

PhysicalFabricate and TestSystem Integration Design For Test

Cont rol

Observe

PhysicalFabricate and TestSystem Integration

VLSI CAD Flow: Overview Constrained optimization problem: Opt. metric at almost all stages: One of power, delay, chip-area, total wire length (WL) w/ constraints (upper bounds) on the others plus on temperature (approx. power density), crosstalk (routing stage), yield/ variability (lower bound), etc. Very complex processes.

Disconnect between higher (above physical design) and lower stages (PD & lower): At higher stages some metrics such as those that are fully or partly interconnect-based (delay, area, dynamic power) cannot be v. accurately estimated incorrect decisions which percolate to lower stages (e.g. after routing between an adder and mult., interconnect length is much larger than anticipated due to say congestion and mis-routing) some specs (esp. delay) may not be met iterating back to higher stages to correct the approximations and re-design

(HLS)

Iter

ate

if s

imu

lati

on

iden

tifi

es

met

ric

pro

ble

ms

(ho

t sp

ots

, ti

min

g vi

ola

tio

n, c

ross

talk

)

Shantanu Dutt, UIC

VLSI CAD Flow: Overview

Iterate if simulation identifies metric problems (hot spots, timing violation, crosstalk)

Course Coverage: with particular focus on HLS and Physical Design

(HLS)

Shantanu Dutt, UIC

VLSI CAD Flow: Overview

Iterate if simulation identifies problems (hot spots, timing violation, crosstalk)

(HLS)

Partitioning cuts up the system into multiple subsystems to min. some metric (generally number of wires/nets cut—min-cut part.) either for multi-chip impl. or for recursive invokeation for placement on a single chip.

Floorplanning places major/large ploygon-shaped modules on a chip to minimize chip area, wirelength (WL), delay or dynamic power with constraints on the others.

Placement places smaller rectangular cells whose size/shape are well defined in sub-chip regions for the same goals as FP.

Routing interconnects cells/modules via hor/vert or 45 deg. wires for similar goals as above. Shantanu Dutt, UIC

7

Advanced VLSI Design CMPE 641ASIC Design Flow

Standard Cell Place and Route Flow

Adapted from: CMOS VLSI Design, A Circuits and Systems PerspectiAdapted from: CMOS VLSI Design, A Circuits and Systems Perspective, 3rd Editionve, 3rd Edition,, Neil Weste et al. © 2005 Pearson Addison-Wesley

A complete chip showing floorplanned modules

• Some modules are custom designed & layed-out (e.g., RAMs), while others are automated using, say, standard cells

Modules

Routing

Shantanu Dutt, UIC

Another complete chip showing RAMs (custom design) + standard-cell layout

RAMs

Standard-cell based layout (most of the chip)

Shantanu Dutt, UIC

What HLS Produces by Examples

Shantanu Dutt

UIC

Hardware Synthesis – Example 1

GenerationComputer Code

Unit (FSM)

ldc

lda

ldbCPU

Mux3_select

b c

a

ADDER

Mux

MuxMuxMux2_selectMux1_select

Hardware Synthesis

Exe

cuti

on in

a C

PU

Control

Read Bus A Read Bus B

Mux

Register File

Write Bus

Con

rtol

sig

nals

Conrtol signals

Programming language statement

a := b + c;

Load r5 bLoad r7 c

Store r2 a

(r2 <- r5 + r7)ADD r2 r5 r7

Reg_r/w

Reg_addr

Alu_oper_select

Mux_select

r2

32

ADD ALU

r7

r5

3232

Shantanu Dutt UIC 17

17

Hardware Synthesis – Example 2

O/Ps of code block

I/Ps to code block

SynthesizedRecursively

Computer

a computerExecution in

10

10

ldblda

Mux1_select

zero

ovflsign

Demux1_selectDemux

Block B forHardware

Mux

2’s compl

ba

Hardware Synthesis

Block B Adder

Block A forHardware

GenerationComputer Code

Programming language construct

end Block B of code;else beginend Block A of code;if (a <= b) then begin

assmb code

operations)

(Adder may bereused for other

Unit (FSM)Control

Con

rtol

sig

nals

Conrtol signals

Block C

B:

Load r3 bLoad r2 a

BZ ABNEG A

SUB r2 r2 r3

assmb codeBlock AA:JMP C

assmb codeC:

Shantanu Dutt UIC 18

18

dutt
Note
should be XOR(sign, ovfl) and not AND(sign, not(ovfl)), as shown

Brief Intro to Logic Optimization & Technology Mapping

Extracted from Notes by:

Srinivas Devdas, MIT

3

Two-Level Logic Minimization

Can realize an arbitrary logic function in sum-of-products or two-level form

F1 = A B + A B D + A B C D+ A B C D + A B + A B D

F1 = B + D + A C + A C

Of great interest to find a minimum sum-of-products representation

– Solved problem even for functions with 100’s of inputs (variants of Quine-McCluskey)

4

Two-Level versus Multilevel

2-Level:

6 product terms which cannot be shared.24 transistors in static CMOS

Multi-level:

Note that B + C is a common term in f1 and f2

K = B + C 3 Levels20 transistors in static CMOSnot counting inverters

f1 = AB + AC + ADf2 = AB + AC + AE

f1 = ΑΚ + AD

f2 = AK + AE

5

Technologies

“Closed book”: gate-arraystandard-cell

“Open book”: CMOS Domino,complex gate static CMOS

LOGIC EQUATIONS

TECHNOLOGY-INDEPENDENTOPTIMIZATION

FactoringCommonality Extraction

LIBRARYTECH-DEPENDENT OPTIMIZATION(MAPPING, TIMING)

OPTIMIZED LOGIC NETWORK

6

Tech.-Independent Optimization

Involves:Minimizing two-level logic functions.Finding common subexpressions.Substituting one expression into another.Factoring single functions.

Factored versus Disjunctive forms

sum-of-products or disjunctive form

factored formmulti-level or complex gate

f = ac + ad + bc + bd + ae

f = a + b( ) c + d( ) + a e

7

Optimizations

Factor F

Extract common expression

F =f1 = AB + AC + AD + AE + A BC D E

f2 = AB + AC + AD + AF + A BC D F⎧⎨⎩

F =f1 = A B + C + D + E( ) + ABC DE

f2 = A B + C + D + F( ) + ABC DF⎧⎨⎩

G =g1 = B + C + Df1 = A g1 + E( ) + A E g1

f2 = A g1 + F( ) + A F g1

⎧⎨

⎩⎪

8

What Does “Best” Mean?

Transistor count AREANumber of circuits POWERNumber of levels DELAY

(Speed)

Need quick estimators of area, delay and powerwhich are also accurate

17

Tech.-Dependent Optimization

Area, delay and power dissipation cost functions

OPTIMIZED LOGIC EQUATIONS

TECHNOLOGY MAPPING

GATENETLIST

LIBRARYTIMING

CONSTRAINTS

18

“Closed Book” Technologies

A standard cell technology or library is typically restricted to a few tens of gatese.g., MSU library: 31 cells

Gates may be NAND, NOR, NOT, AOIs.

A

A

A

C

A

B

AB+C

B

C

A

19

Mapping via DAG Covering

Represent network in canonical form⇒ subject DAG

Represent each library gate with canonical forms for the logic function⇒ primitive DAGs

Each primitive DAG has a cost

Goal: Find a minimum cost covering of the subject DAG by the primitive DAGs

Canonical form: 2-input NAND gates and inverters

20

Sample Library

INVERTER 2

NAND2 3

NAND3 4

NAND4 5

21

Sample Library - 2

AOI21 4

AOI22 5

Standard cell library

• For each cell (e.g., NANDs, NORs, Invs,

AOIs)

– Functional information

– Timing information

• Output slew

• Intrinsic delay

• Input capacitance

– Physical footprint

– Power characteristics (leakage power)

Trivial Covering

Reduce netlist into ND2 gates → subject DAG

7 NAND2 = 21 (# pins) = 28 (# transistors) 5 INV = 10 (# pins) = 10 (# transistors

31 = 38 (pin cost) (area cost)

Covering #1

2 INV = 4 (# pins) = 4 (# trans’s) 2 NAND2 = 6 = 8 1 NAND3 = 4 = 6 1 NAND4 = 5 = 8

19 26 (pin cost) (area cost)

Covering #2

1 INV = 2 (# pins) = 2 (# trans’s) 1 NAND2 = 3 = 4 2 NAND3 = 8 = 12 1 AOI21 = 4 = ?

17 ? (pin cost) (area cost)

25

Sound Algorithmic approachNP-hard optimization problem

Tree covering heuristic: If subject and primitive DAGs are trees, efficient algorithm can find optimum cover in linear time⇒ dynamic programming formulation

DAG Covering

multiple fanout

Multiple fan-out

Partitioning a Graph

• Partition input netlist into a forest of trees • Solve each tree optimally • Stitch trees back together • Any issues to take care of when doing TM for each individual tree so that the stitching back can be done (this is really a bookkeeping issue rather than intrinsically algorithmic)?

27

Resulting Trees

Break at multiple fanout points

28

Dynamic Programming

Principle of optimality: Optimal cover for a tree consists of a match at the root of the tree plus the optimal cover for the sub-trees starting at each input of the match

x

y

z

p

Best cover forthis match usesbest covers forx, y, z

Best cover forthis match usesbest covers forp, z

Optimum Tree Covering

NAND2 3

AOI21 4 + 3 = 7

INV 11 + 2 = 13

NAND2 2 + 6 + 3 = 11

NAND2 3 + 3 = 6

NAND2 3

INV 2

Is it a good idea to use the largest possible “cover” or “cut” going top-down

or bottom-up (this may not be visible in this small example)?

DP Algorithm for Tech. Mapping of Tree Circuits

Procedure DP_TM(p: gate output, G: circuit); /* G is a tree circuit */

if DP_TM(p, G) is “stored” then return(its cost, corresp. solution);

Going backward from p to its fanins, produce all possible “cuts” (that separate the sub-ckt w/ p as o/p from the rest of the circuit) so that the # of wires cut <= max # of i/ps available for cells/gates in the library;

cost = infinity;

For each such cut Ci do

if the subckt containing p is mappable to some gate gi in the library AND does not include any “forbidden” interconnects then begin

cost(Ci) = cost(gi) + S qj in Ci (DP_TM(qj, G)[1]); /* if DP_TM(qj, G) has already been investigated, it will be “stored”, and there is no need to execute this procedure again */

if cost(Ci) < cost then { cost = cost(Ci); solution = corresponding soln. }

end if.

End for;

return(cost, solution);

end DP_TM;

Shantanu Dutt, UIC

Example Cuts in DP_TM C3

C2

C1

C4

Recursive Calls for DP_TM

C3’

C4’

C2’ C1’

Ci’ is the set of gate o/ps that are in the “complement” (remainder) subckt. of G generated by cut Ci

Shantanu Dutt, UIC

DP Algorithm for Tech. Mapping of Tree Circuits

Analysis:

• Optimality?

• Runtime analysis: – # of subsets (cuts) that can be generated from a gate g’ o/p q that includes g and includes fan-

ins only up to a fan-in size of m (= max. number of i/ps among all cells in the library: if k is the # of all gate i/ps that can be enclosed by a cut across all valid cuts w/o/p at q, then # of valid cuts = O(2k), and k itself is O(2m), and thus # of valid cuts is O(2**(2m)). However, since m is a constant like 6, the # of valid cuts per gate o/p turns out to be a medium-size constant in the range 20-50.

– # of 1st time calls of DP_TM for each gate o/p

– Total # of DP_TM calls

• What if the circuit is not a tree: how to process, optimal or non-optimal?

• How to obtain optimality for a general DAG and at the cost of how much increased runtime?

Shantanu Dutt, UIC

What about delay optimization?

Shantanu Dutt, UIC

Cin(g(C1))

Rout(g(C2)) Rout(g(C4)) Rout(g(C3))

C1

C2

C3 C4 d(g(C2))

d(g(C4))

d(g(C1))

Delay till root cut C2 i/p: D1

D2

D3

D1

• Each cell g has an o/p resistance Rout(g), input capacitance Cin(g) and intrinsic delay d(g).

What about delay optimization?

Shantanu Dutt, UIC

Cin(g(C1))

Rout(g(C2)) Rout(g(C4)) Rout(g(C3))

C1

C2

C3 C4 d(g(C2))

d(g(C4))

d(g(C1))

Delay till root cut C2 i/p: D1

D2

D3

D1

• Each cell g has an o/p resistance Rout(g), input capacitance Cin(g) and intrinsic delay d(g).

• Delay at the C1 input = max[D1 + Rout(g(C2))*Cin(g(C1)), D2 + Rout(g(C3))*Cin(g(C1)), D1 + Rout(g(C2))*Cin(g(C1)) ]

Extension to DAGs: What is the problem?

Shantanu Dutt, UIC

u

Ci Cj

Fig. 1

Ck

Cr

u

Ci Cj

Fig. 2

Ck

Cr

• 3 different calls, one for each fanout, to TM of DAG rooted at u (TM(u)).

• Will the solution for TM(u) that leads to the best overall solution for the entire problem be affected by which cut/cover TM(u) is called for when the optimization metric is:

a) Area or pin cost? b) Delay?

TM(u)

Appendix (not necessary to consider): Tackling fanouts in DAGs Black arcs represent parent-child relationship among cuts in a D&C context

The only way a sibling cut can change the subcircuit rooted at u is if it enclosed a “predecessor” of u as by the cut Cr in Fig. 2. But such a cut can only be valid if it cuts only one gate o/p, which has to be the one for its root node, which is not the case here (Cr is invalid as cuts two node o/ps). Such a “changing” cut is valid only if its root node also is a predecessor of u as in Fig. 3. In such cases we will need to do as many TM calls of the subcircuit subckt(u) rooted at u as there are changes to it made by other cuts, and this can potentially be exponential over all such combinations of TMs of similar nodes as u. Practically there may be very few such “changing” cuts of subckt(u), but even if there are 2, we need to consider the combination of all such changed subckt(u)’s for all u’s w/ fanouts leading to an exponential combination if nodes w/ fanouts are Q(n), where n is the total # of nodes. If such nodes are a constant then we get a linear complexity TM algorithm as in the tree case.

u

Ci Cj

Fig. 1

Ck

Cr

u

Ci Cj

Fig. 2

Ck

Cr

u

Ci Cj

Fig. 3

Ck

Cr

Cr’ Two cuts Cr and Cr’ that “change” subckt(u)

top related