ece465 high level design strategies

Upload: muhammad-yusuf

Post on 06-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Ece465 High Level Design Strategies

    1/23

    ECE 465

    High Level Design Strategies

    Lecture Notes # 9

    Shantanu Dutt

    Electrical & Computer Engineering

    University of Illinois at Chicago

  • 8/3/2019 Ece465 High Level Design Strategies

    2/23

    Outline Circuit Design Problem

    Solution Approaches:

    Truth Table (TT) vs. Computational/AlgorithmicYes, hardware, just like software can implement anyalgorithm!

    Flat vs. Divide-&-Conquer Divide-&-Conquer:

    Associative operations/functions

    General operations/functions

    Other Design Strategies for fast circuits: Speculative computation

    Best of both worlds (best average and best worst-case)

    Pipelining

    Summary

  • 8/3/2019 Ece465 High Level Design Strategies

    3/23

    Circuit Design Problem Design an 8-bit comparator that compares two 8-bit #s available in

    two registers A[7..0] and B[7..0], and that o/ps F = 1 if A > B and F =0 if A

  • 8/3/2019 Ece465 High Level Design Strategies

    4/23

    Circuit Design Problem (contd)

    Approach 2: Think computationally/algorithmically aboutwhat the ckt is supposed to compute:

    Approach 2(a): Flat algorithmic approach:

    Note: A TT can be expressed as a sequence of if-then-elses

    If A = 00000000 and B = 00000000 then F = 0

    else if A = 00000000 and B = 00000001 then F=0

    .

    else if A = 00000001 and B = 00000000 then F=1

    .

    Essentially a re-hashing of the TT same problems as the TTapproach

  • 8/3/2019 Ece465 High Level Design Strategies

    5/23

    Circuit Design Problem: Strategy 1: Divide-&-Conquer

    Approach 2(b): Structured algorithmic approach:

    Be more innovative, think of the structure/propertiesof the

    computational problem E.g., think if the problem can be solved in a hierarchical or divide-

    &-conquer (D&C) manner:

    Subprob. A1

    A1,1 A1,2 A2,1 A2,2

    Root problem A

    Subprob. A2

    Stitch-up of solns to A1 and A2to form the complete soln to A

    D&C approach: See if the problem can be broken up into 2 or more smallersubproblems whose solutions can be stitched-up to give a soln. to the parentprob. Do this recrusively for each large subprob until subprobs are small enough forTT-based solutions

    If the subprobs are of a similar kind (but of smaller size) to the root prob thenthe breakup and stitching will also be similar

    Do recursively until subprob-size

    is s.t. TT-based design is doable

  • 8/3/2019 Ece465 High Level Design Strategies

    6/23

    Shift Gears: Design of a Parity Detection CircuitA Series of XORs(b) 16-bit parity tree

    Delay = (# of levels in

    AND-OR tree) * td =log2(n) *td

    x(15) x(14) x(1) x(0)

    w(3,0)

    w(3,1)

    w(3,2)

    w(3,3)

    w(3,4)

    w(3,5)

    w(3,6)

    w(3,7)

    w(2,0)w(2,1)w(2,2)w(2,3)

    w(1,0)w(1,1)

    w(0,0) = f

    An example of simpledesigner ingenuity---abad design wouldhave resulted in alinear delay that theVHDL code & thesynthesis tool wouldhave been at the

    mercy of.

    x(0)

    x(1)

    x(2)X(3)

    x(15) f

    (a) A linearly-connected circuit

    No concurrency in design (a)---the actual problem hasavailable concurrency, though, and it is not exploited well inthe above linear design Complete sequentialization leading to a delay that is linearin the # of bits n(delay = n*td), td= delay of 1 gate All the available concurrency is exploited in design (b)---a

    parity tree.Question: When can we have a tree-structured circuit foran operation on multiple operands?Answer: (1) When the operation makes sense for any # ofoperands. (2) It should be possible to break it down intooperations w/ fewer operands. (3) When the operation isassociative. An oper. x is said to be associative if:

    a x b x c = (a x b) x c = a x (b x c). Thus if we have 4 operations a x b x c x d, we can eitherperform this as a x (b x (c x d)) [getting a linear delay of 3units] or as (a x b) x (c x d) [getting a logarithmic (base 2)delay of 2 units and exploiting the available concurrency dueto the fact that x is associative].We can extend this idea to noperands (& n-1 operations) to perform as many ofthe pairwise operations as possible in parallel (& do this recursively for every level

    of remaining operations), similar to design (b) for the parity detector [xor is anassociative operation!] and thus get a (log2 n) delay.

    f = (((x(15) xor x(14)) xor (x(13) xor x(12))) xor ((x(11) xor x(10)) xor (x(9) xor x(8))) )xor (((x(7) xor x(6)) xor (x(5) xor x(4))) xor ((x(3) xor x(2)) xor (x(1) xor x(0))))

  • 8/3/2019 Ece465 High Level Design Strategies

    7/23

    D&C for Associative Operations Let f(xn-1, .., x0) be an associative function. What is the D&C principle involved in the design of an n-bit xor/parity

    function? Can it also lead automatically to a tree-based ckt?

    f(a,b)

    a b

    f(xn-1, .., x0)

    Stitch-up function---same as theoriginal function for 2 inputs

    Using the D&C approach for an associative operation results in the stitchup function being the same as the original function (not the case for non-assoc. operations), but w/ a constant # of operands (2, if the orig problemis broken into 2 subproblems) If the two sub-problems of the D&C approach are balanced (of the samesize or as close to it as possible), then unfolding the D&C results in a

    balanced operation tree of the type for the xor/parity function seen earlier

    f(xn-1, .., xn/2) f(xn/2-1, .., x0)

  • 8/3/2019 Ece465 High Level Design Strategies

    8/23

    D&C Approach for Non-Associative Opers: n-bit Comparator

    A

    A[i] B[i] f1(i) f2(i)0 0 0 10 1 0 01 0 1 01 1 0 1

    If A[i] = B[i] then { f1(i)=0; f2(i) = 1; /* f2(i) o/p is an i/p to the stitch logic */

    /*f2(i) =1 meansf1( ), f2( ) o/ps of the LS of this subtreeshould be selected by the stitch logic as its o/ps */else if A[i] < B[i} then { f1(i) = 0; /* indicates < */f2(i) = 0 } /* indicates f1(i), f2(i) o/ps should be selected by stitch logic as its o/ps */else if A[i] > B[i] then {f1(i) = 1; /* indicates > */

    f2(i) = 0 }

    The TT may be derived directly or by first thinking of and expressing itscomputation in a high-level programming languageand then convertingit to a TT.

    Useful property: At anylevel, comp. of MS (mostsignificant) half determineso/p if result is > or < else

    comp. of LS determ. o/p Can thus break up problemat any level into MS andLS comparisons & basedon their results determinewhich o/p to choose for thehigher-level (parent) result

    Comp A[7..4],B[7..4]

    Comp. A[7..0]],B[7..0] Stitch-up of solns toA1 and A2 to form the

    complete soln to A

    A1 A2Comp A[3..0],B[3..0]

    If A1 reslt is> or < takeA1 reslt elsetake A2 reslt

    Comp A[7..6],B[7..6] Comp A[5,4],B[5,4]

    A1,1 A1,2

    If A1,1,1 reslt is> or < takeA1,1,1 reslt elsetake A1,1,2 reslt

    Comp A[7],B[7] Comp A[6],B[6]

    If A1,1 reslt is> or < takeA1,1 reslt elsetake A1,2 reslt

    A1,1,1

    A1,1,2

    Small enough to bedesigned using a TT

    (2-bit 2-o/p comparator)

    Is this is associative?not sure For a non-associative func,determine its propeties that allowdetermining a break-up & a

    correct stitch-up function

  • 8/3/2019 Ece465 High Level Design Strategies

    9/23

    Comparator Circuit Design Using D&C (contd.)

    Comp A[7..4],B[7..4]

    Comp. A[7..0]],B[7..0] Stitch-up of solns to A1 and A2to form the complete soln to A

    A

    A1A2

    Comp A[3..0],B[3..0]

    If A1 reslt is> or < takeA1 reslt elsetake A2 reslt

    Comp A[7..6],B[7..6]Comp A[5,4],B[5,4]

    A1,1 A1,2

    If A1,1,1 reslt is> or < takeA1,1,1 reslt elsetake A1,1,2 reslt

    Comp A[7],B[7] Comp A[6],B[6]

    If A1,1 reslt is> or < takeA1,1 reslt elsetake A1,2 reslt

    A1,1,1 A1,1,2

    A[i] B[i] f1(i) f2(i)0 0 0 10 1 0 01 0 1 01 1 0 1

    Stitch up logic details:If f2(i) = 0 then { my_op1=f1(i);my_op2=f2(i) }/* select MS comp o/ps */else/* select LS comp. o/ps */

    {my_op1=f1(i-1); my_op2=f2(i-1) }

    Stitch-uplogic

    f1(i) f2(i)

    my_op1 my_op2

    f1(i-1) f2(i-1)

    f1(i) f2(i) f1(i-1) f2(i-1) my_op1 my_op2X 0 X X f1(i) f2(i)X 1 X X f1(i-1) f2(i-1)

    OR

    Once the D&C tree is formulatedit is easy to get the low-level &stitch-up designs Stitch-up design shown here

    (Compact TT)

    2-bit2:1 Mux

    2

    2 2

    f(i) f(i-1)

    my_op

    f2(i)

    I0 I1

    (Direct design)

  • 8/3/2019 Ece465 High Level Design Strategies

    10/23

    Comparator Circuit Design Using D&C Final Design

    2-bit2:1 Mux

    2

    2 2

    my(3)

    f2(7) = f(7)(2)

    I0 I1

    1-bitcomparator

    f(7)

    A[7] B[7]

    2

    1-bitcomparator

    f(6)

    A[6] B[6]

    2

    1-bitcomparator

    f(5)

    A[5] B[5]

    2

    1-bitcomparator

    f(4)

    A[4] B[4]

    2

    1-bitcomparator

    f(3)

    A[3] B[3]

    2

    1-bitcomparator

    f(2)

    A[2] B[2]

    2

    1-bitcomparator

    f(1)

    A[1] B[1]

    2

    1-bitcomparator

    f(0)

    A[0] B[0]

    2

    2-bit2:1 Mux

    2

    2 2

    my(2)

    f(5)(2)

    I0 I1

    2-bit2:1 Mux

    2

    2 2

    my(1)

    f(3)(2)

    I0 I1

    2-bit2:1 Mux

    2

    2 2

    my(0)

    f(1)(2)

    I0 I1

    2-bit2:1 Mux

    2

    2 2

    my(5)

    my(3)(2)

    I0 I1

    2-bit2:1 Mux

    2

    2 2

    my(4)

    my(1)(2)

    I0 I1

    my(5)(2)1-bit

    2:1 Mux

    F= my1(6)

    I0 I1

    my(5)(1) my(4)(1)

    Log n level

    of Muxes

    Delay(8-bit comp.) = 3 (delay of 2:1Mux) + delay of 2-bit comp. Note parallelism at work multiplelogic blocks are processing simult.

    Delay(n-bit comp.) = log n (delay of2:1 Mux) + delay of 2-bit comp.

    H/W_cost(8-bit comp.) =7(HW_cost(2:1 Muxes)) +8(H/W_cost(2-bit comp.)

    H/W_cost(n-bit comp.) =(n-1)(H/W_cost(2:1 Muxes)) +n(H/W_cost(2-bit comp.))

  • 8/3/2019 Ece465 High Level Design Strategies

    11/23

    D&C: Top-Down vs Bottom-Up: Mux Design

    2:1Sn-1

    Sn-2 S0

    2n-1 :1

    MUX

    I0

    12 nI n-1

    12 nI

    Sn-2 S0

    2n-1 :1

    MUX

    n-12n

    I

    (a) Top-Down

    2:1

    2:1

    2:1

    Sn-1 S1

    2n-1 :1

    MUX

    S0

    S0

    S0

    2n-1

    2:1

    MUXes

    (b) Bottom-Up

    Generally better to try top-down first

    All bits exceptmsb shouldhave differentcombinations;msb should be

    at a constantvalue (here 0)

    MSB value should differ

    among these 2 groups

    All bits exceptmsb shouldhave different

    combinations;msb should beat a constantvalue (here 1)

  • 8/3/2019 Ece465 High Level Design Strategies

    12/23

    8:1

    MUX

    I0

    I1

    I2

    I3

    I4

    I5

    I6

    I7

    S2 S1 S0

    An 8:1 MUX example (bottom-up)

    I1

    2:1MUX

    S0

    I0

    I32:1MUX

    S0

    I2

    I5S0

    I4

    I7

    2:1MUX

    S0

    I6

    2:1

    MUX

    4:1MUX

    S2 S1

    I0

    I2

    I4

    I6

    Z

    I1

    I3

    I5

    I7

    Selected when S0 = 1

    Selected when S0 = 0

    Z

    These inputs shouldhave differentlsb or S0values, since their sel. isbased on S0 (all otherremaining, i.e., unselectedbit values should be thesame). Similarly for otheri/p pairs at 2:1 Muxes at

    this level.

  • 8/3/2019 Ece465 High Level Design Strategies

    13/23

    8:1

    MUX

    I0

    I1

    I2

    I3

    I4

    I5

    I6

    I7

    S2 S1 S0

    Opening up the 8:1 MUXs hierarchical design and a top-down view

    I1

    2:1

    MUXS0

    I0

    I32:1MUX

    S0

    I2

    I5S0

    I4

    I7

    2:1MUX

    S0

    I6

    2:1

    MUX

    I0

    I2

    I4

    I6

    Z

    2:1

    MUX

    2:1

    MUX

    2:1

    MUXZ

    S1

    S1

    S2

    I2

    I6

    I6

    Selected when S0 = 0, S1 = 1.

    These i/ps should differ in S2

    Selected whenS0 = 0, S1 = 1, S2=1

    4:1 Mux

    4:1 Mux

    All bits except msb should have

    different combinations; msb

    should be at a constant value

    (here 0)

    All bits except msb should have

    different combinations; msb

    should be at a constant value

    (here 1)

    MSB value should differ

    among these 2 groups

    Add D i i D C

  • 8/3/2019 Ece465 High Level Design Strategies

    14/23

    Adder Design using D&C Example: Ripple-Carry Adder

    (RCA)

    Stitching up: Carry from LS n/2 bits

    is input to carry-in of MS n/2 bits ateach level of the D&C tree.

    Leaf subproblem: Full Adder (FA)

    Example: Carry-Lookahead Adder(CLA)

    Division: 4 subproblems per level

    Stitching up: A more complexstitching up process (generation ofsuper P,Gs to connect up thesubproblems)

    Leaf subproblem: 4-bit basic CLAwith small p, g bits.

    More intricate techniques (like P,Ggeneration in CLA) for complexstitching up for fast designs mayneed to be devised that is notdirectly suggested by D&C. But

    D&C is a good starting point.

    Add n-bit #s X, Y

    Add MS n/2 bits

    of X,Y

    Add LS n/2 bits

    of X,Y

    FA FA FA FA

    (a) D&C for Ripple-Carry Adder

    Add n-bit #s X, Y

    Add ms n/4 bits Add 3rd n/4 bits Add 2nd n/4 bits Add ls n/4 bits

    4-bit CLA 4-bit CLA 4-bit CLA 4-bit CLA

    (b) D&C for Carry-Lookahead Adder

    D d R l ti i D&C

  • 8/3/2019 Ece465 High Level Design Strategies

    15/23

    Dependency Resolution in D&C:(1) The Wait Strategy

    Strategy 1: Waitfor required o/p of A1 and then perform A2, e.g.,

    as in a ripple-carry adder: A = n-bit addition, A1 = (n/2)-bit addition

    of the L.S. n/2 bits, A2 = (n/2)-bit addition of the M.S. n/2 bits

    No concurrency between A1 and A2:

    t(A) = t(A1) + t(A2) + t(stich-up)

    = 2*t(A1) + t(stitch-up) if A1 and A2 are the same problems ofthe same size (w/ different i/ps)

    Subprob. A2

    Root problem A

    Subprob. A1

    Data flow

    So far we have seen D&C breakups in which there is no datadependency between the two (or more) subproblems of the breakup

    Data dependency leads to increased delays We now look at various ways of speeding up designs that havesubproblem ependencies in their D&C breakups

  • 8/3/2019 Ece465 High Level Design Strategies

    16/23

    Note: Gate delay is propotional to # of inputs (since, generally there is a seriesconnection of transistors in either the up or down network = # of inputsRs of the

    transistors in series add up and is prop to # of inputs delay ~ RC (C is capacitive load)

    is prop. to # of inputs)

    Assume each gate i/p contributes 2 ns of delay

    For a 16-bit adder the delay will be 160 ns

    For a 64 bit adder the delay will be 640 ns

    Example of the Wait Strategy in Adder Design

  • 8/3/2019 Ece465 High Level Design Strategies

    17/23

    Dependency Resolution in D&C:(2) The Design-for-all-cases-&-select or Speculative Strategy

    Other variations---Predict Strategy: Have a single copy of A2 but choose a highly likely

    value of the k-bit i/p and perform A1, A2 concurrently. If after k-bit i/p from A1 is available and

    selection is incorrect, re-do A2 w/ correct available value.

    t(A) = p(correct-choice)*max(t(A1), t(A2)) +[(1-p(correct-choice)]*t(A2) + t(Mux) + t(stich-up),

    where p(correct-choice) is probability that our choice of the k-bit i/p for A2 is correct

    Need a completion signal to indicate when the final o/p is available for A; assuming worst-

    case time (when the choice is incorrect) is meaningless is such designs

    Root problem A

    Subprob. A1Subprob. A2

    Subprob. A2

    Subprob. A2

    Subprob. A2

    4-to-1Mux

    Select i/p

    00

    01

    10

    11

    I/p00

    I/p01

    I/p10

    I/p11

    Strategy 2: For a k-bit i/p from A1 to A2, design2k copies of A2 each with a different hardwired k-

    bit i/p to replace the one from A1. Select the correct o/p from all the copies of A2via a (2k)-to-1 Mux that is selected by the k-bito/p from A1 when it becomes available (e.g.,carry-select adder) t(A) = max(t(A1), t(A2)) + t(Mux) + t(stich-up)= t(A1) + t(Mux) + t(stitch-up) if A1 and A2 are

    the same problems

  • 8/3/2019 Ece465 High Level Design Strategies

    18/23

    Example of the Speculative Strategy in Adder Design

    For a 16-bit adder, the delay is (9*48)*2 = 56 ns (2 ns is the delay for a single

    i/p); a 65% improvement ((160-56)*100/160)

    For a 64-bit adder, the delay is (9*88)*2 = 128 ns; an 80% improvement.

    D d R l ti i D&C

  • 8/3/2019 Ece465 High Level Design Strategies

    19/23

    Dependency Resolution in D&C:(3) The Lookahead or Pre-Computation Strategy

    Strategy 3: Redo the design of A2 so that it can do as much processing as possible that is independent of

    the i/p from A1 (A2_indep = A2_lookahd). This is the lookahead computation that prepares for the final

    computation of A2 (A2_dep) that can start once A2_indep and A1 are done.

    t(A) = max(t(A1), t(A2_indep)) + t(A2_dep) + t(stitch-up)

    E.g., Carry-looakahead adder --- does lookahead computation; also looakahead compuattion is

    associative, so doable in (log n). Overall computation is also doable in (log n) time.

    A less structured example: Let a1 be the i/p from A1 to A2. If A2 has the logic:

    a2 = vx + uvx + wxy + wza1 + uxa1. If this were implemented using 2-i/p AND/OR gates, the delay will

    be 8 delay units (1 unit = delay for 1 i/p) after a1 is available. If the logic is re-structured as

    a2= (vx + uvx + wxy) + (wz + ux)a1, and if the logic in the 2 brackets are performed before a1 is

    available (these constitute A2_indep), then the delay is only 4 delay units after a1 is available.

    Root problem A

    Subprob. A1

    Data flow

    Subprob.

    A2

    A2_dep

    A2_indep

    orA2_lookahd

    Concept

    a2 a2

    w x y w z a1u x a1v x u v x

    A2

    Critical path aftera1 avail (8-unit delay)

    w x y w z u x a1v x u v x

    A2_indepA2_dep

    Critical path aftera1 avail (4-unit delay)

    Example of an unstructured logic for A2

  • 8/3/2019 Ece465 High Level Design Strategies

    20/23

    D&C Summary

    For complex digital design, we need to think of the computationunderlying the design in an algorithmic manner---are there propertiesof this computation that can be exploited for faster, less expensive,modular design; is it amenable to the D&C approach?

    The design is then developed in an algorithmic manner & thecorresponding circuit may be synthesized by hand or describedcompactly using a HDL

    For an operation/func x on n operands (an-1 x an-2 x x a0 ) if x isassociative, the D&C approach gives an easy stitch-up function,which is x on 2 operands (o/ps of applying x on each half). This resultsin a tree-structured circuit with (log n) delay instead of a linearly-connected circuit with (n) delay can be synthesized.

    If x is non-associative, more ingenuity and determination of propertiesof x is needed to determine the stitch-up function. The resulting designmay or may not be tree-structured

    D&C can be done top-down or bottom-up. Top-down generally better

    way to think for beginners If there is dependency between the 2 subproblems, then we saw

    strategies for addressing these dependencies: Wait (slowest, least hardware cost) Speculative (fastest, highest hardware cost) Lookahead (medium speed, medium hardware cost)

  • 8/3/2019 Ece465 High Level Design Strategies

    21/23

    Strategy 2: A general view of speculativecomputations (w/ or w/o D&C) If there is a data dependency between two

    or more portions of a computation (whichmay be obtained w/ or w/o using D&C),dont wait for the the previous computation

    to finish before starting the next one

    Assume all possible input values for thenext computation/stage B (e.g., if it has 2

    inputs from the prev. stage there will be 4possible input value combinations) andperform it using a copy of the design forpossible input value.

    All the different o/ps of the diff. Copies of B

    are Muxed using prev. stage As o/p

    E.g. design: Carry-Select Adder (at eachstage performs two additions one for carry-in of 0 and another for carry-in of 1 from theprevious stage)

    B Ax

    yz

    B(0,0)0

    0

    B(0,1)0

    1

    B(1,0)1

    0

    B(1,1)1

    1

    Ax

    y

    4:1Mux

    z

    (a) Original design: Time = T(A)+T(B)

    (b) Speculative computation: Time = max(T(A),T(B)) + T(Mux).Works well when T(A) approx = T(B) and T(A) >> T(Mux)

  • 8/3/2019 Ece465 High Level Design Strategies

    22/23

    Strategy 3: Get the Best of Both Worlds(Average and Worst Case Delays)!

    Use 2 circuits with different worst-case and average-case behaviors Use the first available output

    Get the best of both (ave-case, worst-case) worlds

    In the above schematic, we get the good ave case performance ofunary division (assuming uniformly distributed inputs w/o the

    disadvantage of its bad worst-case performance)

    Unary

    Division Ckt

    (good ave

    case, bad

    worst case)

    Non-

    Restoring

    Div. Ckt

    (bad ave

    case, good

    worst case)

    Ext.

    FSM done2done1

    start

    Muxselect

    outputoutput

    inputs inputsRegisters

    Register

  • 8/3/2019 Ece465 High Level Design Strategies

    23/23

    Strategy 4: Pipeline It!

    Original ckt

    or datapath

    Stage 1

    Stage 2

    Stage k

    Conversion

    to a simple

    level-partitionedpipeline (level

    partition may not

    always be possible

    but other pipe-

    lineable partitions

    may be)

    Throughput is defined as # of outputs / sec Non-pipelined throughput = (1 / D), where D = delay of original ckts datapath Pipeline thoughput = 1/ (max stage delay + register delay)Special case: If original ckts datapath is divided into n stages, each of equal delay,and dr is the delay of a register, then pipeline thoughput = 1/((D/n)+dr). If d

    r

    is negligible compared to D/n, then pipeline throughput = n/D, n times that of theoriginal ckt