lecture 3 · 2021. 1. 26. · lecture 3 eecs 470 slide 1 eecs 470 lecture 3 pipelining & hazards i...

51
Lecture 3 Slide 1 EECS 470 EECS 470 Lecture 3 Pipelining & Hazards I Jon Beaumont GAS STATION Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Mudge, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, University of Pennsylvania, and University of Wisconsin.

Upload: others

Post on 03-Feb-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

  • Lecture 3 Slide 1 EECS 470

    EECS 470

    Lecture 3

    Pipelining & Hazards I

    Jon Beaumont

    GAS STATION

    Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Mudge, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, and Wenisch of Carnegie Mellon University, Purdue University, University of Michigan, University of Pennsylvania, and University of Wisconsin.

  • Lecture 3 Slide 2 EECS 470

    Announcements

    • Reminder Lab #1 due Friday by end of lab (12:30 pm)

    Get checked off during GSI/IA OH

    Verilog assignment #1 due Friday evening Submit to autograder by 11:59p 25 submissions so far (30% of class)

    HW # 1 due Thursday 2/4 Submit through Gradescope by 11:59p

    • Adding more staff OH Also experimenting with different formats to see what works best Check the description on the Google calendar for each session for

    details We'll try to come up with a consistent format to avoid confusion

  • Lecture 3 Slide 3 EECS 470

    Project 1

  • Lecture 3 Slide 4 EECS 470

    Project 1

  • Lecture 3 Slide 5 EECS 470

    Last Time

    • Quantifying performance Latency vs throughput Different averaging techniques (arithmetic, harmonic, geometric)

    • Power and Energy

  • Lecture 3 Slide 6 EECS 470

    Today

    • Baseline processor discussion Review 5-stage pipeline from EECS 370 Introduce Hazards

  • Lecture 3 Slide 7 EECS 470

    Lingering Questions

    • "Could you give a few more examples of computing applications where we care more about throughput??"

    • Remember, you can submit lingering questions to cover next lecture at: https://bit.ly/3oSr5FD

    Latency Throughput

    Real-time systems (self driving

    cars, drones, etc)

    Scientific computing (e.g.

    simulations)

    Web search Autograding a class's projects

    Processing audio/video Training machine learning

    models

    https://bit.ly/3oSr5FD

  • Lecture 3 Slide 8 EECS 470

    Capacitive Power dissipation

    Power ~ ½ CV2Af

    Capacitance: Function of wire length, transistor size

    Supply Voltage: Has been dropping with successive fab generations

    Clock frequency: Increasing…

    Activity factor: How often, on average, do wires switch?

    Wh

    at u

    ses

    po

    wer

    in a

    ch

    ip?

  • Lecture 3 Slide 9 EECS 470

    Voltage Scaling

    • Scenario: 80W, 1 BIPS, 1.5V, 1GHz Cache Optimization:

    IPC decreases by 10%, reduces power by 20% => Final Processor: 900 MIPS, 64W

    • What if we just adjust frequency/voltage on processor? How to reduce power by 20%?

    P = CV2F = CV3 => Drop voltage by 7% (and also Freq) => .93*.93*.93 = .8x

    So for equal power (64W)

    Cache Optimization = 900MIPS Simple Voltage/Frequency Scaling = 930MIPS W

    hat

    use

    s p

    ow

    er in

    a c

    hip

    ?

  • Lecture 3 Slide 10 EECS 470

    Power scales roughly cubically

    with frequency Scale clock frequency to 80% Now add a second core

    Same power budget, but 1.6x performance!

    But: Must parallelize application Remember Amdahl’s Law!

    Multicore: Solution to Power-constrained design?

    Performance Power

  • Lecture 3 Slide 11 EECS 470

    The Execution Core: Pipelining

  • Lecture 3 Slide 12 EECS 470

    Outline for next several lectures: Understanding the Execution Core

    High-level design feature -> Actual microarchitecture example 1. Pipelining: 370’s 5-stage pipeline (review)

    2. Dynamic scheduling: Scoreboard (CDC 6600)

    3. Register Renaming: Tomasulo’s algorithm (IBM 360)

    4. Precise interrupts with Reorder Buffer:

    P6 MIPS R10K

  • Lecture 3 Slide 13 EECS 470

    Before there was pipelining…

    Basic datapath: fetch, decode, execute • Single-cycle control:

    + Low CPI (1) – Long clock period (to accommodate slowest instruction)

    • Multi-cycle control: + Short clock period – High CPI

    + Potentially better overall latency if designed well (could it ever be worse?)

    insn0.fetch, dec, exec

    Single-cycle

    Multi-cycle

    insn1.fetch, dec, exec

    insn0.dec insn0.fetch

    insn1.dec insn1.fetch

    insn0.exec

    insn1.exec

  • Lecture 3 Slide 14 EECS 470

    Speeding Up

    Remember, three ways to speed up a process:

    • Reduce number of tasks (possible?) • Decrease latency of tasks (what would that include?) • Parallelize

    How do we parallelize this pipeline?

    insn0.fetch, dec, exec

    Single-cycle

    Multi-cycle

    insn1.fetch, dec, exec

    insn0.dec insn0.fetch

    insn1.dec insn1.fetch

    insn0.exec

    insn1.exec

  • Lecture 3 Slide 15 EECS 470

    Parallelize

    Duplicate pipeline (superscalar)

    • Effective, but expensive (~2x hardware overhead) • Discuss more later in semester

    • Or… pipeline!

    insn0.dec insn0.fetch

    insn1.dec insn1.fetch

    insn0.exec

    insn1.exec

    insn0.dec insn0.fetch

    insn2.dec insn2.fetch

    insn0.exec

    insn2.exec

    insn1.dec insn1.fetch

    insn3.dec insn3.fetch

    insn1.exec

    insn3.exec

  • Lecture 3 Slide 16 EECS 470

    Pipelining

    • Important performance technique Improves throughput at the expense of latency

    Why does latency go up?

    • Begin with multi-cycle design When instruction advances from stage 1 to 2… … allow next instruction to enter stage 1 Each instruction still passes through all stages + But instructions enter and leave at a much faster rate + More instructions executed in parallel

    • Not much hardware overhead (what needs to be added?)

    insn0.dec insn0.fetch

    insn1.dec insn1.fetch Multi-cycle

    Pipelined

    insn0.exec

    insn1.exec

    insn0.dec insn0.fetch

    insn1.dec insn1.fetch

    insn0.exec

    insn1.exec

  • Lecture 3 Slide 17 EECS 470

    Pipeline Illustrated:

    GateDelay

    Comb. Logicn Gate Delay

    GateDelayL

    GateDelayL

    L GateDelayLGateDelayL

    L BW = ~(1/n)

    n--2

    n--2

    n--3

    n--3

    n--3

    BW = ~(2/n)

    BW = ~(3/n)

  • Lecture 3 Slide 18 EECS 470

    370 Processor Pipeline Review

    I-cache Reg

    File PC

    +1

    D-cache ALU

    Fetch Decode Memory

    (Write-back)

    Tpipeline = Tbase / 5

    Execute

  • Lecture 3 Slide 19 EECS 470

    Stage 1: Fetch

    Fetch an instruction from memory every cycle. Use PC to index memory Increment PC (assume no branches for now)

    Write state to the pipeline register (IF/ID) The next stage will read this pipeline register. Note that pipeline register must be edge triggered

  • 20

    Inst

    ruct

    ion

    bit

    s

    IF / ID Pipeline register

    PC

    Instruction

    Memory/

    Cache

    en

    en

    1

    +

    M

    U

    X

    Res

    t of

    pip

    elin

    ed d

    ata

    pa

    th

    PC

    + 1

  • Lecture 3 Slide 21 EECS 470

    Stage 2: Decode

    Decodes opcode bits

    May set up control signals for later stages

    Read input operands from registers file

    specified by regA and regB of instruction bits

    Write state to the pipeline register (ID/EX)

    Opcode

    Register contents

    Offset & destination fields

    PC+1 (even though decode didn’t use it)

  • 22

    Destreg

    Data

    ID / EX Pipeline register

    Con

    ten

    ts

    Of

    reg

    A

    Con

    ten

    ts

    Of

    reg

    B Register File

    regA

    regB

    en

    Res

    t of

    pip

    elin

    ed d

    ata

    pa

    th

    Inst

    ruct

    ion

    bit

    s

    IF / ID Pipeline register

    PC

    + 1

    PC

    + 1

    C

    on

    tro

    l

    Sig

    na

    ls

    Sta

    ge

    1:

    Fet

    ch d

    ata

    path

  • Lecture 3 Slide 23 EECS 470

    Stage 3: Execute

    Perform ALU operation.

    Input operands can be:

    Contents of regA or RegB

    Offset field on the instruction

    Branches: calculate PC+1+offset

    Write state to the pipeline register (EX/Mem)

    ALU result, contents of RegB and PC+1+offset

    Instruction bits for opcode and destReg specifiers

  • 24

    ID / EX Pipeline register

    Con

    ten

    ts

    Of

    reg

    A

    Con

    ten

    ts

    Of

    reg

    B

    Res

    t of

    pip

    elin

    ed d

    ata

    pa

    th

    AL

    U

    Res

    ult

    EX/Mem Pipeline register

    PC

    + 1

    C

    on

    tro

    l

    Sig

    na

    ls

    Sta

    ge

    2:

    Dec

    od

    e d

    ata

    path

    Con

    trol

    Sig

    nals

    PC

    +1

    +o

    ffse

    t

    +

    con

    ten

    ts

    of

    reg

    B

    A

    L

    U M

    U

    X

  • Lecture 3 Slide 25 EECS 470

    Stage 4: Memory Operation

    Perform data cache access for memory ops ALU result contains address for ld and st

    Opcode bits control mem R/W and enable signals

    Write state to the pipeline register (Mem/WB) ALU result and MemData

    Instruction bits for opcode and destReg specifiers

  • 26

    Alu

    Res

    ult

    Mem/WB Pipeline register

    Res

    t of

    pip

    elin

    ed d

    ata

    pa

    th

    Alu

    Res

    ult

    EX/Mem Pipeline register

    Sta

    ge

    3:

    Ex

    ecu

    te d

    ata

    path

    Con

    trol

    Sig

    nals

    PC

    +1

    +o

    ffse

    t

    con

    ten

    ts

    of

    reg

    B

    This goes back to the MUX

    before the PC in stage 1.

    Mem

    ory

    Rea

    d D

    ata

    Data Memory

    en R/W

    Con

    trol

    Sig

    nals

    MUX control

    for PC input

  • Lecture 3 Slide 27 EECS 470

    Stage 5: Write back

    Writing result to register file (if required) Write MemData to destReg for ld instruction

    Write ALU result to destReg for arithmetic instruction

    Opcode bits control register write enable signal

  • 28

    Alu

    Res

    ult

    Mem/WB Pipeline register

    Sta

    ge

    4:

    Mem

    ory

    da

    tap

    ath

    Con

    trol

    Sig

    nals

    Mem

    ory

    Rea

    d D

    ata

    M

    U

    X

    This goes back to data

    input of register file

    This goes back to the

    destination register specifier

    M

    U

    X

    bits 0-2

    bits 16-18

    register write enable

  • Lecture 3 Slide 29 EECS 470

    Sample Code (Simple)

    Run the following code on a pipelined datapath: add 1 2 3 ; reg 3 = reg 1 + reg 2 nand 4 5 6 ; reg 6 = reg 4 & reg 5 lw 2 4 20 ; reg 4 = Mem[reg2+20] add 2 5 5 ; reg 5 = reg 2 + reg 5 sw 3 7 10 ; Mem[reg3+10] =reg 7

  • 30

    PC Inst

    mem

    Reg

    iste

    r fi

    le

    M

    U

    X A

    L

    U

    M

    U

    X

    1

    Data

    memory

    + +

    M

    U

    X

    IF/

    ID

    ID/

    EX

    EX/

    Mem

    Mem/

    WB

    M

    U

    X

    Bits 0-2

    Bits 16-18

    op

    dest

    offset

    valB

    valA

    PC+1 PC+1

    target

    ALU

    result

    op

    dest

    valB

    op

    dest

    ALU

    result

    mdata

    eq? instru

    ction

    0

    R2

    R3

    R4

    R5

    R1

    R6

    R0

    R7

    regA

    regB

    Bits 22-24

    data

    dest

  • 31

    PC Inst

    mem

    Reg

    iste

    r fi

    le

    M

    U

    X A

    L

    U

    M

    U

    X

    1

    Data

    memory

    + +

    M

    U

    X

    IF/

    ID

    ID/

    EX

    EX/

    Mem

    Mem/

    WB

    M

    U

    X

    Bits 0-2

    Bits 16-18

    noop

    0

    0

    0

    0

    0 0

    0

    0

    noop

    0

    0

    noop

    0

    0

    0

    0

    no

    op

    9 12 18 7

    36

    41

    0

    22

    R2

    R3

    R4

    R5

    R1

    R6

    R0

    R7

    Bits 22-24

    data

    dest

    Initial

    State

  • 32

    PC Inst

    mem

    Reg

    iste

    r fi

    le

    M

    U

    X A

    L

    U

    M

    U

    X

    1

    Data

    memory

    + +

    M

    U

    X

    IF/

    ID

    ID/

    EX

    EX/

    Mem

    Mem/

    WB

    M

    U

    X

    Bits 0-2

    Bits 16-18

    noop

    0

    0

    0

    0

    0 1

    0

    0

    noop

    0

    0

    noop

    0

    0

    0

    0 ad

    d 1

    2 3

    9 12 18 7

    36

    41

    0

    22

    R2

    R3

    R4

    R5

    R1

    R6

    R0

    R7

    Bits 22-24

    data

    dest

    Fetch:

    add 1 2 3

    add 1 2 3

    Time: 1

  • 33

    PC Inst

    mem

    Reg

    iste

    r fi

    le

    M

    U

    X A

    L

    U

    M

    U

    X

    1

    Data

    memory

    + +

    M

    U

    X

    IF/

    ID

    ID/

    EX

    EX/

    Mem

    Mem/

    WB

    M

    U

    X

    Bits 0-2

    Bits 16-18

    add

    3

    3

    9

    36

    1 2

    0

    0

    noop

    0

    0

    noop

    0

    0

    0

    0 na

    nd

    4 5

    6

    9 12 18 7

    36

    41

    0

    22

    R2

    R3

    R4

    R5

    R1

    R6

    R0

    R7

    1

    2

    Bits 22-24

    data

    dest

    Fetch:

    nand 4 5 6

    nand 4 5 6 add 1 2 3

    Time: 2

  • 34

    PC Inst

    mem

    Reg

    iste

    r fi

    le

    M

    U

    X A

    L

    U

    M

    U

    X

    1

    Data

    memory

    + +

    M

    U

    X

    IF/

    ID

    ID/

    EX

    EX/

    Mem

    Mem/

    WB

    M

    U

    X

    Bits 0-2

    Bits 16-18

    nand

    6

    6

    7

    18

    2 3

    4

    45

    add

    3

    9

    noop

    0

    0

    0

    0 lw 2

    4 2

    0

    9 12 18 7

    36

    41

    0

    22

    R2

    R3

    R4

    R5

    R1

    R6

    R0

    R7

    4

    5

    Bits 22-24

    data

    dest

    Fetch:

    lw 2 4 20

    lw 2 4 20 nand 4 5 6 add 1 2 3

    Time: 3

    36

    9

    1

    3

    3

  • 35

    PC Inst

    mem

    Reg

    iste

    r fi

    le

    M

    U

    X A

    L

    U

    M

    U

    X

    1

    Data

    memory

    + +

    M

    U

    X

    IF/

    ID

    ID/

    EX

    EX/

    Mem

    Mem/

    WB

    M

    U

    X

    Bits 0-2

    Bits 16-18

    lw

    4

    20

    18

    9

    3 4

    8

    -3

    nand

    6

    7

    add

    3

    45

    0

    0 ad

    d 2

    5 5

    9 12 18 7

    36

    41

    0

    22

    R2

    R3

    R4

    R5

    R1

    R6

    R0

    R7

    2

    4

    Bits 22-24

    data

    dest

    Fetch:

    add 2 5 5

    add 2 5 5 lw 2 4 20 nand 4 5 6 add 1 2 3

    Time: 4

    18

    7

    2

    6

    6

    45

    3

  • 36

    PC Inst

    mem

    Reg

    iste

    r fi

    le

    M

    U

    X A

    L

    U

    M

    U

    X

    1

    Data

    memory

    + +

    M

    U

    X

    IF/

    ID

    ID/

    EX

    EX/

    Mem

    Mem/

    WB

    M

    U

    X

    Bits 0-2

    Bits 16-18

    add

    5

    5

    7

    9

    4 5

    23

    29

    lw

    4

    18

    nand

    6

    -3

    0

    0 sw 3

    7 1

    0

    9 45 18 7

    36

    41

    0

    22

    R2

    R3

    R4

    R5

    R1

    R6

    R0

    R7

    2

    5

    Bits 22-24

    data

    dest

    Fetch:

    sw 3 7 10

    sw 3 7 10 add 2 5 5 lw 2 4 20 nand 4 5 6 add

    Time: 5

    9

    20

    3

    20

    4

    -3

    6

    45

    3

  • 37

    PC Inst

    mem

    Reg

    iste

    r fi

    le

    M

    U

    X A

    L

    U

    M

    U

    X

    1

    Data

    memory

    + +

    M

    U

    X

    IF/

    ID

    ID/

    EX

    EX/

    Mem

    Mem/

    WB

    M

    U

    X

    Bits 0-2

    Bits 16-18

    sw

    7

    10

    22

    45

    5

    9

    16

    add

    5

    7

    lw

    4

    29

    99

    0

    9 45 18 7

    36

    -3

    0

    22

    R2

    R3

    R4

    R5

    R1

    R6

    R0

    R7

    3

    7

    Bits 22-24

    data

    dest

    No more

    instructions

    sw 3 7 10 add 2 5 5 lw 2 4 20 nand

    Time: 6

    9

    7

    4

    5

    5

    29

    4

    -3

    6

  • 38

    PC Inst

    mem

    Reg

    iste

    r fi

    le

    M

    U

    X A

    L

    U

    M

    U

    X

    1

    Data

    memory

    + +

    M

    U

    X

    IF/

    ID

    ID/

    EX

    EX/

    Mem

    Mem/

    WB

    M

    U

    X

    Bits 0-2

    Bits 16-18

    15

    55

    sw

    7

    22

    add

    5

    16

    0

    0

    9 45 99 7

    36

    -3

    0

    22

    R2

    R3

    R4

    R5

    R1

    R6

    R0

    R7

    Bits 22-24

    data

    dest

    No more

    instructions

    sw 3 7 10 add 2 5 5 lw

    Time: 7

    45

    5

    10

    7

    10

    16

    5

    99

    4

  • 39

    PC Inst

    mem

    Reg

    iste

    r fi

    le

    M

    U

    X A

    L

    U

    M

    U

    X

    1

    Data

    memory

    + +

    M

    U

    X

    IF/

    ID

    ID/

    EX

    EX/

    Mem

    Mem/

    WB

    M

    U

    X

    Bits 0-2

    Bits 16-18

    sw

    7

    55

    0

    9 45 99 16

    36

    -3

    0

    22

    R2

    R3

    R4

    R5

    R1

    R6

    R0

    R7

    Bits 22-24

    data

    dest

    No more

    instructions

    sw 3 7 10 add

    Time: 8

    22 55

    22

    16

    5

  • 40

    PC Inst

    mem

    Reg

    iste

    r fi

    le

    M

    U

    X A

    L

    U

    M

    U

    X

    1

    Data

    memory

    + +

    M

    U

    X

    IF/

    ID

    ID/

    EX

    EX/

    Mem

    Mem/

    WB

    M

    U

    X

    Bits 0-2

    Bits 16-18

    9 45 99 16

    36

    -3

    0

    22

    R2

    R3

    R4

    R5

    R1

    R6

    R0

    R7

    Bits 22-24

    data

    dest

    No more

    instructions

    sw

    Time: 9

  • Lecture 3 Slide 41 EECS 470

    Time graphs

    Time: 1 2 3 4 5 6 7 8 9

    add

    nand

    lw

    add

    sw

    fetch decode execute memory writeback

    fetch decode execute memory writeback

    fetch decode execute memory writeback

    fetch decode execute memory writeback

    fetch decode execute memory writeback

  • Lecture 3 Slide 42 EECS 470

    Balancing Pipeline Stages

    IF

    ID

    EX

    MEM

    WB

    TIF= 6 units

    TID= 2 units

    TEX= 9 units

    TMEM= 5 units

    TWB= 8 units

    What is the speedup of the pipelined processor over the single-cycle processor (assuming no hazards)?

    a) 2/30

    b) 6/30

    c) 9/30

    d) 30/9

    e) 30/6

    f) 30/2

    g) No idea

    Can we do better in terms of either performance or efficiency?

  • Lecture 3 Slide 43 EECS 470

    Balancing Pipeline Stages

    Two Methods for Stage Quantization: Merging of multiple stages Further subdividing a stage

    Recent Trends: Deeper pipelines (more and more stages)

    Pipeline depth growing more slowly since Pentium 4. Why?

    Multiple pipelines Pipelined memory/cache accesses (tricky)

  • Lecture 3 Slide 44 EECS 470

    The Cost of Deeper Pipelines

    Instruction pipelines are not ideal i.e. Instructions in different stages can have dependencies

    Suppose add 1 2 3

    nand 3 4 5

    F D E M W F D E M W

    t0 t1 t2 t3 t4 t5 Inst0 Inst1

    F D E M W F D E M W

    t0 t1 t2 t3 t4 t5 add nand E Stall

    F E M D Stall D

    RAW!!

    (read-after-write

    dependency)

  • Lecture 3 Slide 45 EECS 470

    Types of Dependencies and Hazards

    Data Dependence (Both memory and register) True dependence (RAW)

    Instruction must wait for all required input operands

    Anti-Dependence (WAR) Later write must not clobber a still-pending earlier read

    Output dependence (WAW) Earlier write must not clobber already-completed later write

    Control Dependence (aka Procedural Dependence) Conditional branches may change instruction sequence Instructions after cond. branch depend on outcome (more exact definition later)

    Not an

    issue

    now, but

    stay

    tuned

  • Lecture 3 Slide 46 EECS 470

    Terminology

    Pipeline Hazards: Potential violations of program dependences Must ensure program dependences are not violated

    Hazard Resolution: Static Method: Performed at compiled time in software Dynamic Method: Performed at run time using hardware

    Pipeline Interlock: Hardware mechanisms for dynamic hazard resolution Must detect and enforce dependences at run time

  • Lecture 3 Slide 48 EECS 470

    Handling Data Hazards

    Avoidance (static) Make sure there are no hazards in the code

    Detect and Stall (dynamic) Stall until earlier instructions finish

    Detect and Forward (dynamic) Get correct value from elsewhere in pipeline

  • Lecture 3 Slide 49 EECS 470

    Handling Data Hazards: Avoidance

    Programmer/compiler must know implementation details Insert noops between dependent instructions

    add 1 2 3 noop noop nand 3 4 5

    write R3 in cycle 5

    read R3 in cycle 6

  • Lecture 3 Slide 50 EECS 470

    Problems with Avoidance

    Binary compatibility New implementations may require more noops

    Code size Higher instruction cache footprint Longer binary load times Worse in machines that execute multiple instructions / cycle

    Intel Itanium – 25-40% of instructions are noops

    Slower execution CPI=1, but many instructions are noops

  • Lecture 3 Slide 51 EECS 470

    Handling Data Hazards: Detect & Stall

    Detection Compare regA & regB with DestReg of preceding insn.

    3 bit comparators

    Stall Do not advance pipeline register for Fetch/Decode Pass noop to Execute

  • Lecture 3 Slide 52 EECS 470

    Next Time

    • Continue 5-stage review Discuss detect-and-forward in depth

    • Lingering questions / feedback? I'll include an anonymous form at the end of every lecture: https://bit.ly/3oSr5FD

    https://bit.ly/3oSr5FD