cs 6354: pipelining / isascr4bd/6354/f2016/slides/lec05-slides-1up.pdf · ng if. in comparison,...
TRANSCRIPT
![Page 1: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/1.jpg)
CS 6354: Pipelining / ISAs
7 September 2016
1
![Page 2: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/2.jpg)
Review: Memory Hierarchy
2
![Page 3: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/3.jpg)
Review: Page Tables
CR3
3239404748555663 08162431 15 723
......
4K m
emor
y pa
ge
Linear address:
64 bit PDentry
......
page directory
......
PDPentry
page-directory-pointer table
64 bit PTentry
......
page table
......
PML4entry
PML4 table99
40*
9 9 12
sign extended
*) 40 bits aligned to a 4-KByte boundary
3
![Page 4: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/4.jpg)
Review: Memory HierarchyOptimizations
adjust # caches, sizes, associativity, block size, …
adjust when virtual to physical translation happens
add victim caches, prefetching, etc.
cache blocking — reorder code for more reuse
overlap memory accesses and
4
![Page 5: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/5.jpg)
Human pipeline: laundry
Washer
Dryer
FoldingTable
Washer
Dryer
FoldingTable
11:00 12:00 13:00 14:00
11:00 12:00 13:00 14:00
whites
whites
whites
colors
colors
colors
whites
whites
whites
colors
colors
colors
sheets
sheets
sheets
5
![Page 6: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/6.jpg)
Human pipeline: laundry
Washer
Dryer
FoldingTable
Washer
Dryer
FoldingTable
11:00 12:00 13:00 14:00
11:00 12:00 13:00 14:00
whites
whites
whites
colors
colors
colors
whites
whites
whites
colors
colors
colors
sheets
sheets
sheets5
![Page 7: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/7.jpg)
The MIPS pipeline
Copyright © 2011, Elsevier Inc. All rights Reserved. 17
Figure C.28 The stall from branch hazards can be reduced by moving the zero test and branch-target calculation into the ID phase of the pipeline. Notice that we have made two important changes, each of which removes 1 cycle from the 3-cycle stall for branches. The first change is to move both the branch-target address calculation and the branch condition decision to the ID cycle. The second change is to write the PC of the instruction in the IF phase, using either the branch-target address computed during ID or the incremented PC computed during IF. In comparison, Figure C.22 obtained the branch-target address from the EX/MEM register and wrote the result during the MEM clock cycle. As mentioned in Figure C.22, the PC can be thought of as a pipeline register (e.g., as part of ID/IF), which is written with the address of the next instruction at the end of each IF cycle.
Figure: H&P Appendix C 6
![Page 8: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/8.jpg)
MIPS instruction execution (1)
add $1, $2, $3 ; reg[1] <− reg[2] + reg[3]
Instruction Fetch: read from instruction cache
IF/ID stores: instr., PC
Instruction Decode: read registers 2 and 3
ID/EX stores: reg[2], reg[3], instr., PC
Execute: compute reg[2] + reg[3]
EX/MEM stores: reg[2] + reg[3], instr., PC
Memory: do nothing
MEM/WB stores: reg[2] + reg[3], instr., PC
Write Back: write computed value into reg[1]7
![Page 9: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/9.jpg)
MIPS instruction execution (2)
sw r1, 100(r3) ; memory[100 + reg[3]] = reg[1]
Instruction Fetch: read from instruction cache
IF/ID stores: instr., PC
Instruction Decode: read registers 1 and 3
ID/EX stores: reg[1], reg[3], instr., PC
Execute: compute 100 + reg[3]
EX/MEM stores: 100 + reg[3], reg[1], instr., PC
Memory: store reg[1] into data @ 100 + reg[3]
MEM/WB stores: instr., PC
Write Back: do nothing8
![Page 10: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/10.jpg)
The MIPS pipeline
Copyright © 2011, Elsevier Inc. All rights Reserved. 17
Figure C.28 The stall from branch hazards can be reduced by moving the zero test and branch-target calculation into the ID phase of the pipeline. Notice that we have made two important changes, each of which removes 1 cycle from the 3-cycle stall for branches. The first change is to move both the branch-target address calculation and the branch condition decision to the ID cycle. The second change is to write the PC of the instruction in the IF phase, using either the branch-target address computed during ID or the incremented PC computed during IF. In comparison, Figure C.22 obtained the branch-target address from the EX/MEM register and wrote the result during the MEM clock cycle. As mentioned in Figure C.22, the PC can be thought of as a pipeline register (e.g., as part of ID/IF), which is written with the address of the next instruction at the end of each IF cycle.
Figure: H&P Appendix C 9
![Page 11: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/11.jpg)
MIPS instruction execution (1)
add $1, $2, $3 ; reg[1] <− reg[2] + reg[3]
Instruction Fetch: read from instruction cacheIF/ID stores: instr., PC
Instruction Decode: read registers 2 and 3ID/EX stores: reg[2], reg[3], instr., PC
Execute: compute reg[2] + reg[3]EX/MEM stores: reg[2] + reg[3], instr., PC
Memory: do nothingMEM/WB stores: reg[2] + reg[3], instr., PC
Write Back: write computed value into reg[1]10
![Page 12: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/12.jpg)
MIPS instruction execution (2)
sw r1, 100(r3) ; memory[100 + reg[3]] = reg[1]
Instruction Fetch: read from instruction cacheIF/ID stores: instr., PC
Instruction Decode: read registers 1 and 3ID/EX stores: reg[1], reg[3], instr., PC
Execute: compute 100 + reg[3]EX/MEM stores: 100 + reg[3], reg[1], instr., PC
Memory: store reg[1] into data @ 100 + reg[3]MEM/WB stores: instr., PC
Write Back: do nothing11
![Page 13: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/13.jpg)
MIPS executing
Copyright © 2011, Elsevier Inc. All rights Reserved. 3
Figure C.3 A pipeline showing the pipeline registers between successive pipeline stages. Notice that the registers prevent interference between two different instructions in adjacent stages in the pipeline. The registers also play the critical role of carrying data for a given instruction from one stage to the other. The edge-triggered property of registers—that is, that the values change instantaneously on a clock edge—is critical. Otherwise, the data from one instruction could interfere with the execution of another!
12
![Page 14: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/14.jpg)
Pipeline Hazards
hazards stop pipeline from executing at full rate
structural hazards — not enough hardware
data hazards — value not computed soon enough
control hazards — instruction to execute not knownsoon enough
13
![Page 15: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/15.jpg)
Functional Hazards
Copyright © 2011, Elsevier Inc. All rights Reserved. 4
Figure C.4 A processor with only one memory port will generate a conflict whenever a memory reference occurs. In this example the load instruction uses the memory for a data access at the same time instruction 3 wants to fetch an instruction from memory.
Figure: H&P Appendix C 14
![Page 16: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/16.jpg)
Read-after-Write
add r1, r2, r3 ; r1 <− r2 + r3sub r4, r1, r5 ; r5 <− r1 − r5
add r1, r2, r3 sub r4, r1, r51 IF2 ID: read r2, r3 IF3 EX : temp1 ← r2 + r3 ID: read r1, r54 MEM EX : temp2 ← r1 - r55 WB: r1 ← temp MEM6 WB: r4 ← temp2
15
![Page 17: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/17.jpg)
Read-after-Write
add r1, r2, r3 ; r1 <− r2 + r3sub r4, r1, r5 ; r5 <− r1 − r5
add r1, r2, r3 sub r4, r1, r51 IF2 ID: read r2, r3 IF3 EX : temp1 ← r2 + r3 ID: read r1, r54 MEM EX : temp2 ← r1 - r55 WB: r1 ← temp MEM6 WB: r4 ← temp2
15
![Page 18: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/18.jpg)
Read-after-Write — Stall
add r1, r2, r3 ; r1 <− r2 + r3sub r4, r1, r5 ; r5 <− r1 − r5
add r1, r2, r3 sub r4, r1, r51 IF2 ID: read r2, r3 IF3 EX : temp1 ← r2 + r3 stall4 MEM stall5 WB: r1 ← temp1 stall6 ID: read r1, r57 EX : temp2 ← r1 + r58 MEM9 WB: r4 ← temp2
16
![Page 19: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/19.jpg)
Read-after-Write — Stall
add r1, r2, r3 ; r1 <− r2 + r3sub r4, r1, r5 ; r5 <− r1 − r5
add r1, r2, r3 sub r4, r1, r51 IF2 ID: read r2, r3 IF3 EX : temp1 ← r2 + r3 stall4 MEM stall5 WB: r1 ← temp1 stall6 ID: read r1, r57 EX : temp2 ← r1 + r58 MEM9 WB: r4 ← temp2
16
![Page 20: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/20.jpg)
Implementing Stalls
disable writing pipeline registersneed logic to detect conflicts
function of pipeline registers (instruction values)
17
![Page 21: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/21.jpg)
Read-After-Write
Copyright © 2011, Elsevier Inc. All rights Reserved. 5
Figure C.6 The use of the result of the DADD instruction in the next three instructions causes a hazard, since the register is not written until after those instructions read it.
Figure: H&P Appendix C 18
![Page 22: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/22.jpg)
Read-after-Write — Forward
add r1, r2, r3 ; r1 <− r2 + r3sub r4, r1, r5 ; r5 <− r1 − r5
add r1, r2, r3 sub r4, r1, r51 IF2 ID: read r2, r3 IF3 EX : temp1 ← r2 + r3 ID: read r1, r54 MEM EX : temp2 ← temp1 - r55 WB: r1 ← temp MEM6 WB: r4 ← temp2
19
![Page 23: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/23.jpg)
Forwarding
Copyright © 2011, Elsevier Inc. All rights Reserved. 6
Figure C.7 A set of instructions that depends on the DADD result uses forwarding paths to avoid the data hazard. The inputs for the DSUB and AND instructions forward from the pipeline registers to the first ALU input. The OR receives its result by forwarding through the register file, which is easily accomplished by reading the registers in the second half of the cycle and writing in the first half, as the dashed lines on the registers indicate. Notice that the forwarded result can go to either ALU input; in fact, both ALU inputs could use forwarded inputs from either the same pipeline register or from different pipeline registers. This would occur, for example, if the AND instruction was AND R6,R1,R4.
Figure: H&P Appendix C 20
![Page 24: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/24.jpg)
Implementing Forwarding
multiplexers for operand valuesneed logic to detect which one to use
function of pipeline registers (instruction values)
21
![Page 25: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/25.jpg)
Implementing Forwarding
Copyright © 2011, Elsevier Inc. All rights Reserved. 16
Figure C.27 Forwarding of results to the ALU requires the addition of three extra inputs on each ALU multiplexer and the addition of three paths to the new inputs. The paths correspond to a bypass of: (1) the ALU output at the end of the EX, (2) the ALU output at the end of the MEM stage, and (3) the memory output at the end of the MEM stage.
Figure: H&P Appendix C 22
![Page 26: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/26.jpg)
Limits of Forwarding
Copyright © 2011, Elsevier Inc. All rights Reserved. 8
Figure C.9 The load instruction can bypass its results to the AND and OR instructions, but not to the DSUB, since that would mean forwarding the result in “negative time.”
Figure: H&P Appendix C 23
![Page 27: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/27.jpg)
Scheduling for Pipelineslw r1, 0(r20) ; r1 <− MEM[0+r20]lw r2, 4(r20) ; r2 <− MEM[4+r20]add r3, r1, r2 ; r3 <− r1 + r2lw r4, 8(r20) ; r4 <− MEM[8+r20]add r4, r4, r3 ; r4 <− r4 + r3sw r4, 8(r20) ; MEM[8+r20] <− r4lw r5, 12(r20) ; r5 <− MEM[12+r20]mul r5, r5, r4 ; r5 <− r5 * r4sw r5, 12(r20) ; r5 <− MEM[12+r20]
converts intolw r1, 0(r20) ; r1 <− MEM[0+r20]lw r2, 4(r20) ; r2 <− MEM[4+r20]lw r4, 8(r20) ; r4 <− MEM[8+r20]lw r5, 12(r20) ; r5 <− MEM[12+r20]add r3, r1, r2 ; r3 <− r1 + r2add r4, r4, r3 ; r4 <− r4 + r3mul r5, r5, r4 ; r5 <− r5 * r4sw r4, 8(r20) ; MEM[8+r20] <− r4sw r5, 12(r20) ; r5 <− MEM[12+r20]
24
![Page 28: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/28.jpg)
Scheduling for Pipelineslw r1, 0(r20) ; r1 <− MEM[0+r20]lw r2, 4(r20) ; r2 <− MEM[4+r20]add r3, r1, r2 ; r3 <− r1 + r2lw r4, 8(r20) ; r4 <− MEM[8+r20]add r4, r4, r3 ; r4 <− r4 + r3sw r4, 8(r20) ; MEM[8+r20] <− r4lw r5, 12(r20) ; r5 <− MEM[12+r20]mul r5, r5, r4 ; r5 <− r5 * r4sw r5, 12(r20) ; r5 <− MEM[12+r20]
converts intolw r1, 0(r20) ; r1 <− MEM[0+r20]lw r2, 4(r20) ; r2 <− MEM[4+r20]lw r4, 8(r20) ; r4 <− MEM[8+r20]lw r5, 12(r20) ; r5 <− MEM[12+r20]add r3, r1, r2 ; r3 <− r1 + r2add r4, r4, r3 ; r4 <− r4 + r3mul r5, r5, r4 ; r5 <− r5 * r4sw r4, 8(r20) ; MEM[8+r20] <− r4sw r5, 12(r20) ; r5 <− MEM[12+r20]
24
![Page 29: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/29.jpg)
Scheduling for Pipelineslw r1, 0(r20) ; r1 <− MEM[0+r20]lw r2, 4(r20) ; r2 <− MEM[4+r20]add r3, r1, r2 ; r3 <− r1 + r2lw r4, 8(r20) ; r4 <− MEM[8+r20]add r4, r4, r3 ; r4 <− r4 + r3sw r4, 8(r20) ; MEM[8+r20] <− r4lw r5, 12(r20) ; r5 <− MEM[12+r20]mul r5, r5, r4 ; r5 <− r5 * r4sw r5, 12(r20) ; r5 <− MEM[12+r20]
converts intolw r1, 0(r20) ; r1 <− MEM[0+r20]lw r2, 4(r20) ; r2 <− MEM[4+r20]lw r4, 8(r20) ; r4 <− MEM[8+r20]lw r5, 12(r20) ; r5 <− MEM[12+r20]add r3, r1, r2 ; r3 <− r1 + r2add r4, r4, r3 ; r4 <− r4 + r3mul r5, r5, r4 ; r5 <− r5 * r4sw r4, 8(r20) ; MEM[8+r20] <− r4sw r5, 12(r20) ; r5 <− MEM[12+r20]
24
![Page 30: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/30.jpg)
Scheduling for Pipelineslw r1, 0(r20) ; r1 <− MEM[0+r20]lw r2, 4(r20) ; r2 <− MEM[4+r20]add r3, r1, r2 ; r3 <− r1 + r2lw r4, 8(r20) ; r4 <− MEM[8+r20]add r4, r4, r3 ; r4 <− r4 + r3sw r4, 8(r20) ; MEM[8+r20] <− r4lw r5, 12(r20) ; r5 <− MEM[12+r20]mul r5, r5, r4 ; r5 <− r5 * r4sw r5, 12(r20) ; r5 <− MEM[12+r20]
converts intolw r1, 0(r20) ; r1 <− MEM[0+r20]lw r2, 4(r20) ; r2 <− MEM[4+r20]lw r4, 8(r20) ; r4 <− MEM[8+r20]lw r5, 12(r20) ; r5 <− MEM[12+r20]add r3, r1, r2 ; r3 <− r1 + r2add r4, r4, r3 ; r4 <− r4 + r3mul r5, r5, r4 ; r5 <− r5 * r4sw r4, 8(r20) ; MEM[8+r20] <− r4sw r5, 12(r20) ; r5 <− MEM[12+r20]
24
![Page 31: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/31.jpg)
Next time: Scheduling
Weiss and Smith, “A study of scalar compilationtechniques for pipelined supercomputers”
theme: seperate dependencies from usefocus on loops
25
![Page 32: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/32.jpg)
Control Hazard
need to decode instruction to know next instruction
Copyright © 2011, Elsevier Inc. All rights Reserved. 3
Figure C.3 A pipeline showing the pipeline registers between successive pipeline stages. Notice that the registers prevent interference between two different instructions in adjacent stages in the pipeline. The registers also play the critical role of carrying data for a given instruction from one stage to the other. The edge-triggered property of registers—that is, that the values change instantaneously on a clock edge—is critical. Otherwise, the data from one instruction could interfere with the execution of another!
next instruction known
next instruction needed
26
![Page 33: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/33.jpg)
MIPS Delay Slots
avoid control hazard by delaying branchadd $3, $4, $5 ; (1)beq $1, $2, label ; (2)add $5, $6, $7 ; (3) DELAY SLOTadd $6, $7, $8add $8, $9, $10
label:add $7, $8, $9 ; (4)
27
![Page 34: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/34.jpg)
Branch Prediction
branch prediction — guess whether branch is taken
start guess immediately
clear pipeline registers if wrong
28
![Page 35: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/35.jpg)
Speculation
when is it okay to guess
if we can undo guess if wrong
MIPS pipeline:IF — doesn’t change stateID — doesn’t change stateEX — doesn’t change stateMEM — changes memory!WB — changes registers!
undo: clear pipeline registers before MEM, set newPC
29
![Page 36: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/36.jpg)
Speculation
when is it okay to guess
if we can undo guess if wrongMIPS pipeline:
IF — doesn’t change stateID — doesn’t change stateEX — doesn’t change stateMEM — changes memory!WB — changes registers!
undo: clear pipeline registers before MEM, set newPC
29
![Page 37: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/37.jpg)
Static branch prediction
forwards not taken (fetch normally)
backwards taken (fetch target)
30
![Page 38: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/38.jpg)
Dynamic branch predictionPC
NNNNTTNT
low-order bits
prediction
actualresult
lookup branch address in table
1-bit: Taken/Not taken
taken before ⇒ taken again
31
![Page 39: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/39.jpg)
Dynamic branch predictionPC
NNNN
TNTNT
low-order bits
prediction
actualresult
lookup branch address in table
1-bit: Taken/Not taken
taken before ⇒ taken again
31
![Page 40: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/40.jpg)
Dynamic branch prediction
refinement: 2 bits
Copyright © 2011, Elsevier Inc. All rights Reserved. 11
Figure C.18 The states in a 2-bit prediction scheme. By using 2 bits rather than 1, a branch that strongly favors taken or not taken—as many branches do—will be mispredicted less often than with a 1-bit predictor. The 2 bits are used to encode the four states in the system. The 2-bit scheme is actually a specialization of a more general scheme that has an n-bit saturating counter for each entry in the prediction buffer. With an n-bit counter, the counter can take on values between 0 and 2n – 1: When the counter is greater than or equal to one-half of its maximum value (2n – 1), the branch is predicted as taken; otherwise, it is predicted as untaken. Studies of n-bit predictors have shown that the 2-bit predictors do almost as well, thus most systems rely on 2-bit branch predictors rather than the more general n-bit predictors.
Figure: H&P Appendix C 32
![Page 41: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/41.jpg)
Deeper Pipelines (1)
Copyright © 2011, Elsevier Inc. All rights Reserved. 19
Figure C.35 A pipeline that supports multiple outstanding FP operations. The FP multiplier and adder are fully pipelined and have a depth of seven and four stages, respectively. The FP divider is not pipelined, but requires 24 clock cycles to complete. The latency in instructions between the issue of an FP operation and the use of the result of that operation without incurring a RAW stall is determined by the number of cycles spent in the execution stages. For example, the fourth instruction after an FP add can use the result of the FP add. For integer ALU operations, the depth of the execution pipeline is always one and the next instruction can use the results.
Figure: H&P Appendix C 33
![Page 42: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/42.jpg)
Deeper Pipelines (2)
Copyright © 2011, Elsevier Inc. All rights Reserved. 24
Figure C.44 The basic branch delay is 3 cycles, since the condition evaluation is performed during EX.Figure: H&P Appendix C 34
![Page 43: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/43.jpg)
Microcoded pipelined CPU
35
![Page 44: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/44.jpg)
Less registers? (1)
36
![Page 45: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/45.jpg)
Less registers? + Seperate I-Cache?
37
![Page 46: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/46.jpg)
RISC factors
38
![Page 47: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/47.jpg)
Factors favoring MIPS
operand specifier decoding — 1 cycle per on VAX
seperate floating point registers — seperate FPU
condition code RAW hazards
needless work by, e.g., CISC CALL/RET
filled delay slots
larger page size
larger range for brganches
39
![Page 48: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/48.jpg)
Addressing modes on VAX
40
![Page 49: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/49.jpg)
Addressing modes on VAX
ADDL3 @(R5)+[R6], @(R1)+[R2], @(R3)+[R4]
one instructionsix memory accesses, four register readsthree register writesMEM[MEM[R5]+R6] ← MEM[MEM[R1]+R2]
+ MEM[MEM[R3]+R4]R1 ← R1 + 4R3 ← R3 + 4R5 ← R5 + 4
41
![Page 50: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/50.jpg)
ISA design
lots of non-technical factors
42
![Page 51: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/51.jpg)
Notable RISC V decisions
modular ISA design
optional variable length encoding (code size)
43
![Page 52: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/52.jpg)
Justifications (1)
31 general-purpose registers + 0 register + pcusually 32-bit instructions
“it is impossible to encode a complete ISA with 16registers in 16-bit instructions using a 3-addressformat. Although a 2-address format would bepossible, it would increase instruction count and lowerefficiency. … A larger number of integer registers alsohelps performance on high-performance code,…”“The optional compressed 16-bit instruction formatmostly only accesses 8 registers”
44
![Page 53: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/53.jpg)
Justifications (2)
“Decoding register specifiers is usualy on the critical path … sothe instruction format was chosen to keep all registers specifiersat the same position…”
45
![Page 54: CS 6354: Pipelining / ISAscr4bd/6354/F2016/slides/lec05-slides-1up.pdf · ng IF. In comparison, Figure C.22 obtained the branch-t EX/MEM register lFigure hought of(e.g., ID/IF), which](https://reader030.vdocuments.mx/reader030/viewer/2022020417/5e176c2f1d54de787f756ec9/html5/thumbnails/54.jpg)
Justifications (3)
no delay slotsno condition codes
“condition codes and branch delay slots, whichcomplicate higher performance implementations”
46