pipelining parallel processing

Upload: galaxystar

Post on 14-Jan-2016

25 views

Category:

Documents


1 download

DESCRIPTION

Pipelining Parallel Processing

TRANSCRIPT

  • Digital VLSI ADigital VLSI ADigitalVLSIADigitalVLSIA

    Pipelining&ParaPipelining&Para

    MahdiS

    fSharifUniversity

    M.Shabany,Digital

    Architectures:Architectures:Architectures:Architectures:

    allelProcessingallelProcessing

    Shabany

    fyofTechnology

    lVLSIArchitectures

  • ArchitecturalTechniques: Criticalpathinanydesignisthelonge

    1. Anytwointernallatches/flipflops

    2. An input pad and an internal latch2. Aninputpadandaninternallatch

    3. Aninternallatchandanoutputpad

    4. Aninputpadandanoutputpad

    UseFFsrightafter/before

    input/outpadstoavoid

    th l t th

    InputPad

    thelastthreecases

    (offchipandpackagingdelay)

    The maximum delay between anyThemaximumdelaybetweenanytwosequentialelementsinadesignwilldeterminethemax

    clockspeed

    M.Shabany,Digital

    :CriticalPathestpathbetween

    d

    Comb.Logic

    2

    31OutputPad

    4

    lVLSIArchitectures

  • DigitalDesignMetrics

    Threeprimaryphysicalcharacter Speed Speed

    Throughput Latency Timing Timing

    Area Power

    M.Shabany,Digital

    isticsofadigitaldesign:

    lVLSIArchitectures

  • DigitalDesignMetrics

    Speed Throughput :oug put

    Theamountofdatathatispro

    Latency Latency Thetimebetweendatainput

    Timing Timing ThelogicdelaysbetweensequWhenadesigndoesnotmeetcritical path is greater than the tcriticalpathisgreaterthanthet

    M.Shabany,Digital

    ocessedperclockcycle(bitspersecond)

    andprocesseddataoutput(clockcycle)

    uentialelements(clockperiod)tthetimingitmeansthedelayofthetarget clock periodtargetclockperiod

    lVLSIArchitectures

  • MaximumClockFrequenc

    MaximumClockFrequency:

    logicqclkmax TTTF

    Tclkq :timefromclockarrivaluntildataa Tlogic : propagation delay through logic be Tlogic :propagationdelaythroughlogicbe Trouting :routingdelaybetweenflipflops Tsetup :minimumtimedatamustarriveat Tskew :propagationdelayofclockbetwee

    M.Shabany,Digital

    cy:CriticalPath

    1

    skewroutingsetup TTT1

    arrivesatQ

    etween flipflopsetweenflip flops

    tDbeforethenextrisingedgeofclock

    enthelaunchflipflopandthecaptureflipflop.

    lVLSIArchitectures

  • Pipelining(toImproveThr

    Pipelining: Comesfromtheideaofawate

    waiting the water in the pipe towaitingthewaterinthepipeto Usedtoreducethecriticalpath

    Advantageous: Reductioninthecriticalpath Higher throughput (number of Higherthroughput(numberof Increasestheclockspeed(orsa Reducesthepowerconsumptio

    M.Shabany,Digital

    roughput)

    rpipe:continuesendingwaterwithouto be outobeouthofthedesign

    computed results in a give time)computedresultsinagivetime)amplingspeed)onatsamespeed

    lVLSIArchitectures

  • ArchitecturalTechniques:

    Pipelining: Verysimilartotheassemblylinei Thebeautyofapipelineddesignibeforethepriordatahasfinished,massemblyline.

    M.Shabany,Digital

    :Pipelining

    ntheautoindustry

    sthatnewdatacanbeginprocessingmuchlikecarsareprocessedonan

    lVLSIArchitectures

  • ArchitecturalTechniques: OriginalSystem:(Criticalpath=1

    ComX

    Com

    Clk

    Criticalpath=1 Maxo

    Pipelinedversion:(Criticalpath= SmallerCriticalPathhighert l Longerlatency

    Comb.LogicX

    Clk

    Criticalpath=2

    M.Shabany,Digital

    Pipe

    :PipeliningMaxoperatingfreq:f1=1/1)

    2cyclesmb Logic

    f

    latermb.Logic

    peratingfreq:f1=1/1

    2 Maxoperatingfreq:f2=1/2)throughput(2f1)

    3cycleslater

    Comb.Logicf

    later

    Maxoperatingfreq:f2=1/2

    lVLSIArchitectures

    liningRegister

  • ArchitecturalTechniques:

    Pipelinedepth:0(NoPipeline) Criticalpath:3Adders

    Latency : 0 Latency:0

    ( ) (

    t1 t2

    X(1)

    Y(1)

    X(2

    Y(2

    M.Shabany,Digital

    :Pipelinedepth

    wire w1,w2;assign w1=X+a;assign w2=w1+b;g ;assign Y=w2+c;

    )

    timet3

    X(3)2)

    2)

    X(3)

    Y(3)

    lVLSIArchitectures

  • ArchitecturalTechniques:

    Pipelinedepth:1(OnePipelinereg Criticalpath:2Adders

    a(n) b(n) c

    2

    Latency : 1

    X(n)w1 w2

    Latency:1

    ( ) (2)

    t1 t2

    X(1)

    Y(1)

    X(2)

    M.Shabany,Digital

    :Pipelinedepth

    gisterAdded)

    wire w1;reg w2;assign w1=X+a;assign Y=w2+c;

    c(n)

    g ;

    always@(posedgeClk)w2

  • ArchitecturalTechniques:

    Pipelinedepth:2(OnePipelinereg Criticalpath:1Adder

    Latency : 2 Latency:2t1 t2 t3

    X(1) X(2)

    M.Shabany,Digital

    :Pipelinedepth

    gisterAdded)reg w1,w2;assign Y = w2 + c;assign Y=w2+c;

    always@(posedgeClk)beginw1

  • ArchitecturalTechniques:

    Clockperiodandthroughputasaf

    1 Clockperiod:

    h h Tn1

    Clk Throughput: nT

    Addi i t l iAddingregisterlayersimprovestimingbydividingthecriticalpathintotwopathsofsmallerdelay

    M.Shabany,Digital

    :Pipelining

    functionofpipelinedepth:

    Clock Period

    Throughput

    3 4 5 6Pipeline Depth

    lVLSIArchitectures

    Pipeline Depth

  • ArchitecturalTechniques:

    GeneralRule: Pipelininglatchescanonlybeofthecircuit.

    Cutset: Cutset:Asetofpathsofacircuitsuchtcircuitbecomesdisjoint(i.e.,twoj

    FeedForwardCutset: Acutsetiscalledfeedforwardforwarddirectiononallthepaths

    M.Shabany,Digital

    :Pipelining

    placedacrossfeedforwardcutsets

    thatifthesepathsareremoved,theseparatepieces)p p

    cutsetifthedatamoveinthesofthecutset

    lVLSIArchitectures

  • ArchitecturalTechniques:

    Example: FIRFilter Threefeedforwardcutsetsare

    M.Shabany,Digital

    :Pipelining

    eshownNOTafeedforwardcutset

    X(n) X(n1) X(n2)

    a b c

    Y(n)

    lVLSIArchitectures

  • ArchitecturalTechniques:CriticalPath:1M+2A

    X(n) X(n1) X(n2)

    a b c

    assign w1 = a*Xn;

    Y(n)w1 w2

    w3

    w4

    assign w1=a Xn;assign w2=b*Xn_1;assign w3=w1+w2;assign w4=c*Xn_2;assign Y=w3+w4;g ;always@(posedgeClk)beginXn_1

  • ArchitecturalTechniques:

    X(n) X(n1)

    a b

    2

    Cloc Input 1 2 3

    1 3

    Clock

    Input 1 2 3

    0 X(0) aX(0) aX(0)

    1 X(1) X(1) bX(0) X(1)+bX(0)1 X(1) aX(1) bX(0) aX(1)+bX(0)

    2 X(2) aX(2) bX(1) aX(2)+bX(1)

    3 X(3) aX(3) bX(2) aX(3)+bX(2)

    M.Shabany,Digital

    :Pipelining

    X(n2)

    c

    4

    4 5 Output

    Y(n)3

    45

    4 5 Output

    aX(0) Y(0)

    X(1)+bX(0) Y(1) aX(1)+bX(0) Y(1)

    cX(0) aX(2)+bX(1)+cX(0) Y(2)

    cX(1) aX(3)+bX(2)+cX(1) Y(3)

    lVLSIArchitectures

  • ArchitecturalTechniques:X(n) X(n1)

    a b

    1 2

    Cloc Input 1 2 3

    1

    Clock

    Input 1 2 3

    0 X(0)

    1 X(1) aX(0) aX(0)1 X(1) aX(0) aX(0)

    2 X(2) aX(1) bX(0) aX(1)+bX(0)

    3 X(3) aX(2) bX(1) aX(2)+bX(1)

    M.Shabany,Digital

    :PipeliningX(n2)

    c

    2 3 4 5

    4 5 Output

    Y(n)3 5

    4 5 Output

    aX(0) Y(0) aX(0) Y(0)

    ) aX(1)+bX(0) Y(1)

    ) cX(0) aX(2)+bX(1)+cX(0) Y(2)

    lVLSIArchitectures

  • ArchitecturalTechniques:

    Evenmorepipelining

    Clock Input 1 2 3

    0 X(0)

    1 X(1) aX(0) -2 X(2) aX(1) bX(0) aX(0)

    3 X(3) aX(2) bX(1) aX(1)+bX(0)

    4 X(3) aX(2) bX(1) aX(2)+bX(1)

    M.Shabany,Digital

    :Pipelining

    4 5 Output

    - - aX(0) Y(0)

    aX(1)+bX(0) Y(1)

    cX(0) aX(2)+bX(1)+cX(0) Y(2)

    lVLSIArchitectures

  • ArchitecturalTechniques:

    Pipeliningattheoperationlevel Breakthemultiplierintotwop

    FinePiPipe

    M.Shabany,Digital

    :FineGrainPipelining

    arts

    X(n) X(n1) X(n2)

    a b cm1 m1 m1

    Grainli i

    m2 m2 m2

    lining

    Y(n)

    lVLSIArchitectures

  • UnrollingtheLoopUsingP

    CalculationofX3Throughput=8/3,or2.7bits/clockLatency = 3 clocksLatency 3clocksTiming=Onemultiplierinthecritical

    Iterativeimplementation:p Nonewcomputationscanbeginun

    previouscomputationhascomplet

    M.Shabany,Digital

    Pipelining

    module power3(outputreg[7:0]X3,output finished,

    path input [7:0]X,input clk,start);reg [7:0]ncount;reg [7:0]Xpower,Xin;

    ntiltheted

    assign finished=(ncount ==0);always@(posedge clk)if (start)beginXPower

  • UnrollingtheLoopUsingP

    CalculationofX3Throughput=8/1,or8bits/clock(3XLatency = 3 clocksLatency 3clocksTiming=Onemultiplierinthecritical

    Penalty:MoreAreaUnrollinganalgorithmwithniterativeloo

    throughputbyafactorofn

    Clk

    X2

    X[0:7]

    xpower1 xpower2

    M.Shabany,Digital

    Clk

    Pipelining

    improvement)module power3(output reg[7:0]XPower,input clk,

    pathinput clk,input [7:0]X);reg [7:0]XPower1,XPower2;reg [7:0]X2;always @(posedge clk) beginalways @(posedgeclk)begin//Pipelinestage1XPower1

  • RemovingPipelineRegiste

    CalculationofX3Throughput=8bits/clock(3XimprovLatency = 0 clocksLatency 0clocksTiming=Twomultipliersinthecritica

    Latencycanbereducedbyremovingpipel

    M.Shabany,Digital

    ers(toImproveLatency)

    vement)module power3(O t t [7 0] XP

    lpathOutput [7:0]XPower,input [7:0]X);reg [7:0]XPower1,XPower2;reg [7:0]X1,X2;l @*always @*XPower1=X;

    always @(*)beginX2 XP 1

    lineregisters

    X2=XPower1;XPower2=XPower1*XPower1;

    end

    i XP XP 2 * X2assign XPower =XPower2*X2;

    endmodule

    lVLSIArchitectures

  • ArchitecturalTechniques: I ll l i h h Inparallelprocessingthesamehar

    Increasesthethroughputwithoutcha Increasesthesiliconarea

    X(n)

    a(n)

    Pipelining

    ClockFreq:2f

    Throughput:2Msamples

    M.Shabany,Digital

    :ParallelProcessingd i d li drdwareisduplicatedtoangingthecriticalpath

    b(n)

    Y(n)

    ClockFreq:f

    Throughput:Msamples

    a(2k) b(2k)

    ParallelProcessing

    X(2k) Y(2k)

    a(2k+1) b(2k+1)

    X(2k+1) Y(2k+1)

    ClockFreq:f

    lVLSIArchitectures

    q

    Throughput:2Msamples

  • ArchitecturalTechniques:

    Parallelprocessingfora3tapFIRf Bothhavethesamecriticalpath(M

    X

    ParallelFactor:3

    1)cx(3kbx(3k)1)ax(3k1)y(3k

    2)cx(3k1)bx(3kax(3k)y(3k)

    cx(3k)1)bx(3k2)ax(3k2)y(3k

    1)cx(3kbx(3k)1)ax(3k1)y(3k

    Not a simple dup

    M.Shabany,Digital

    Notasimpledup

    :ParallelProcessing

    filterM+2A)

    X(3k)X(3k+1)X(3k+2)

    a b c

    Y(3k+2)

    X(3k2) X(3k1)

    c a

    Y(3k+1)

    b

    b c a

    plication!

    lVLSIArchitectures

    Y(3k)

    plication!

  • SamplePeriodvs.ClockPe

    Pipelinedsystem:Tclk =Tsample

    MClksample TTT

    ParallelSystem:Tclk Tsamplep

    112T(T

    31

    T31

    T AMClksample

    Higher Sample rate than the

    M.Shabany,Digital

    HigherSampleratethanthe

    eriod

    X(3k)

    a b c

    X(3k+1)X(3k+2)

    Y(3k+2)

    c a b

    X(3k2) X(3k1)

    Y(3k+1)

    b c a

    )A

    clock rate

    lVLSIArchitectures

    Y(3k)

    clockrate

  • CompleteParallelSystem

    ACompleteParallelSystem:

    M.Shabany,Digital

    withS/PandP/S

    lVLSIArchitectures

  • S/PandP/SBlocks

    S/PConverter:

    M.Shabany,Digital

    P/SConverter:

    Y(3k)Y(3k+1)Y(3k+2)

    T T T

    Y(n)

    T

    T/3T/3

    SamplePeriT/3

    T T

    ParalleltoSerialConverter

    lVLSIArchitectures

  • WhenPipeliningWhenPa

    Pipelinetechniqueisusedwhenth(Number1,2,3,4)

    ParallelismisusedwhenthecriticacommunicationorI/Obound.(Numb

    Pipeliningdoesnothelpinthiscas a.k.a CommunicationBounded

    3

    Comb.Logic

    1

    2

    531

    4

    M.Shabany,Digital

    arallelism?

    hecriticalpathisinthedesign

    alpathisboundedbytheber5)se!

    Comb.Logic

    lVLSIArchitectures