lecture 2: processor and pipelining 1sceweb.uhcl.edu/koch/ceng6332/notes/chapter_3-lecture2.pdf ·...
TRANSCRIPT
CENG 6332
Lecture 2: Processor and Pipelining 1
CENG 6332
Chapter 3 – Additional Slides The Processor and Pipelining
The Simple BIG Picture!
2
Datapath vs Control
3
Datapath: Storage, FU, interconnect sufficient to perform the desired functions Inputs are Control Points Outputs are signals
Controller: State machine to orchestrate operation on the data path Based on desired function and signals
Datapath Controller
Control Points
signals
Introduction CPU performance factors Instruction count
Determined by ISA and compiler CPI and Cycle time
Determined by CPU hardware
We will examine two MIPS implementations A simplified version A more realistic pipelined version
Simple subset, shows most aspects Memory reference: lw, sw Arithmetic/logical: add, sub, and, or, slt Control transfer: beq, j
Introduction
4
CENG 6332
Lecture 2: Processor and Pipelining 2
A "Typical" RISC ISA
5
32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero, DP take pair) 3-address, reg-reg arithmetic instruction Single address mode for load/store:
base + displacement no indirection
Simple branch conditions Delayed branch
see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC,CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
Example: MIPS ( MIPS)
6
Op31 26 01516202125
Rs1 Rd immediate
Op31 26 025
Op31 26 01516202125
Rs1 Rs2
target
Rd Opx
Register-Register561011
Register-Immediate
Op31 26 01516202125
Rs1 Rs2/Opx immediate
Branch
Jump / Call
Instruction Execution PC instruction memory, fetch instruction Register numbers register file, read registers Depending on instruction class Use ALU to calculate
Arithmetic result Memory address for load/store Branch target address
Access data memory for load/store PC target address or PC + 4
7
CPU Overview
8
CENG 6332
Lecture 2: Processor and Pipelining 3
Control
10
Building a Datapath Datapath Elements that process data and addresses
in the CPU Registers, ALUs, mux’s, memories, …
We will build a MIPS datapath incrementally Refining the overview design
Building a D
atapath
11
Instruction Fetch
32-bit register
Increment by 4 for next instruction
12
R-Format Instructions Read two register operands Perform arithmetic/logical operation Write register result
13
CENG 6332
Lecture 2: Processor and Pipelining 4
Load/Store Instructions Read register operands Calculate address using 16-bit offset Use ALU, but sign-extend offset
Load: Read memory and update register Store: Write register value to memory
14
Branch Instructions Read register operands Compare operands Use ALU, subtract and check Zero output
Calculate target address Sign-extend displacement Shift left 2 places (word displacement) Add to PC + 4
Already calculated by instruction fetch
15
Branch Instructions
Justre-routes
wires
Sign-bit wire replicated
16
Composing the Elements First-cut data path does an instruction in one clock cycle Each datapath element can only do one function at a time Hence, we need separate instruction and data memories
Use multiplexers where alternate data sources are used for different instructions
17
CENG 6332
Lecture 2: Processor and Pipelining 5
R-Type/Load/Store Datapath
18
Full Datapath
19
ALU Control ALU used for Load/Store: F = add Branch: F = subtract R-type: F depends on funct field
A Sim
ple Implem
entation Schem
eALU control Function
0000 AND0001 OR0010 add0110 subtract0111 set-on-less-than1100 NOR
20
ALU Control Assume 2-bit ALUOp derived from opcode Combinational logic derives ALU control
opcode ALUOp Operation funct ALU function ALU controllw 00 load word XXXXXX add 0010sw 00 store word XXXXXX add 0010beq 01 branch equal XXXXXX subtract 0110R-type 10 add 100000 add 0010
subtract 100010 subtract 0110AND 100100 AND 0000OR 100101 OR 0001set-on-less-than 101010 set-on-less-than 0111
21
CENG 6332
Lecture 2: Processor and Pipelining 6
The Main Control Unit Control signals derived from instruction
0 rs rt rd shamt funct31:26 5:025:21 20:16 15:11 10:6
35 or 43 rs rt address31:26 25:21 20:16 15:0
4 rs rt address31:26 25:21 20:16 15:0
R-type
Load/Store
Branch
opcode always read
read, except for load
write for R-type
and load
sign-extend and add
22
Datapath With Control
23
R-Type Instruction
24
Load Instruction
25
CENG 6332
Lecture 2: Processor and Pipelining 7
Branch-on-Equal Instruction
26
Implementing Jumps
Jump uses word address Update PC with concatenation of Top 4 bits of old PC 26-bit jump address 00
Need an extra control signal decoded from opcode
2 address31:26 25:0
Jump
27
Datapath With Jumps Added
28
Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory register file ALU data memory register file
Not feasible to vary period for different instructions Violates design principle Making the common case fast
We will improve performance by pipelining
29
CENG 6332
Lecture 2: Processor and Pipelining 8
Pipelining Analogy Pipelined laundry: overlapping execution Parallelism improves performance
An O
verview of P
ipelining
Four loads: Speedup
= 8/3.5 = 2.3 Non-stop:
Speedup= 2n/0.5n + 1.5 ≈ 4= number of stages
30
MIPS Pipeline Five stages, one step per stage
1. IF: Instruction fetch from memory2. ID: Instruction decode & register read3. EX: Execute operation or calculate address4. MEM: Access memory operand5. WB: Write result back to register
31
Pipeline Performance Assume time for stages is 100ps for register read or write 200ps for other stages
Compare pipelined datapath with single-cycle datapath
Instr Instr fetch Register read
ALU op Memory access
Register write
Total time
lw 200ps 100 ps 200ps 200ps 100 ps 800ps
sw 200ps 100 ps 200ps 200ps 700ps
R-format 200ps 100 ps 200ps 100 ps 600ps
beq 200ps 100 ps 200ps 500ps
32
Pipeline PerformanceSingle-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
33
CENG 6332
Lecture 2: Processor and Pipelining 9
Pipeline Speedup If all stages are balanced i.e., all take the same time Time between instructionspipelined
If not balanced, speedup is less Speedup due to increased throughput Latency (time for each instruction) does not decrease
34
Stages ofNumber nsInstructioBetween Time ednonpipelin
Hazards Situations that prevent starting the next instruction in the
next cycle Structure hazards
A required resource is busy Data hazard
Need to wait for previous instruction to complete its data read/write Control hazard
Deciding on control action depends on previous instruction
36
Structure Hazards Conflict for use of a resource In MIPS pipeline with a single memory Load/store requires data access Instruction fetch would have to stall for that cycle
Would cause a pipeline “bubble”
Hence, pipelined datapaths require separate instruction/data memories Or separate instruction/data caches
37
Data Hazards An instruction depends on completion of data access by a
previous instruction add $s0, $t0, $t1sub $t2, $s0, $t3
38
CENG 6332
Lecture 2: Processor and Pipelining 10
Forwarding (aka Bypassing) Use result when it is computed Don’t wait for it to be stored in a register Requires extra connections in the datapath
39
Load-Use Data Hazard Can’t always avoid stalls by forwarding If value not computed when needed Can’t forward backward in time!
40
Code Scheduling to Avoid Stalls Reorder code to avoid use of load result in the next
instruction C code for A = B + E; C = B + F;
lw $t1, 0($t0)
lw $t2, 4($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
lw $t4, 8($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
stall
stall
lw $t1, 0($t0)
lw $t2, 4($t0)
lw $t4, 8($t0)
add $t3, $t1, $t2
sw $t3, 12($t0)
add $t5, $t1, $t4
sw $t5, 16($t0)
11 cycles13 cycles
41
Control Hazards Branch determines flow of control Fetching next instruction depends on branch outcome Pipeline can’t always fetch correct instruction
Still working on ID stage of branch
In MIPS pipeline Need to compare registers and compute target early in the
pipeline Add hardware to do it in ID stage
42
CENG 6332
Lecture 2: Processor and Pipelining 11
Stall on Branch Wait until branch outcome determined before fetching
next instruction
43
Branch Prediction Longer pipelines can’t readily determine branch outcome
early Stall penalty becomes unacceptable
Predict outcome of branch Only stall if prediction is wrong
In MIPS pipeline Can predict branches not taken Fetch instruction after branch, with no delay
44
MIPS with Predict Not Taken
Prediction correct
Prediction incorrect
45
More-Realistic Branch Prediction Static branch prediction Based on typical branch behavior Example: loop and if-statement branches
Predict backward branches taken Predict forward branches not taken
Dynamic branch prediction Hardware measures actual branch behavior
e.g., record recent history of each branch
Assume future behavior will continue the trend When wrong, stall while re-fetching, and update history
46
CENG 6332
Lecture 2: Processor and Pipelining 12
Pipeline Summary
Pipelining improves performance by increasing instruction throughput Executes multiple instructions in parallel Each instruction has the same latency
Subject to hazards Structure, data, control
Instruction set design affects complexity of pipeline implementation
The BIG Picture
47
MIPS Pipelined DatapathP
ipelined Datapath and C
ontrol
WB
MEM
Right-to-left flow leads to hazards
48
Pipeline registers Need registers between stages To hold information produced in previous cycle
49
Pipeline Operation Cycle-by-cycle flow of instructions through the pipelined
datapath “Single-clock-cycle” pipeline diagram
Shows pipeline usage in a single cycle Highlight resources used
c.f. “multi-clock-cycle” diagram Graph of operation over time
We’ll look at “single-clock-cycle” diagrams for load & store
50
CENG 6332
Lecture 2: Processor and Pipelining 13
IF for Load, Store, …
51
ID for Load, Store, …
52
EX for Load
53
MEM for Load
54
CENG 6332
Lecture 2: Processor and Pipelining 14
WB for Load
Wrongregisternumber
55
Corrected Datapath for Load
56
EX for Store
57
MEM for Store
58
CENG 6332
Lecture 2: Processor and Pipelining 15
WB for Store
59
Multi-Cycle Pipeline Diagram Form showing resource usage
60
Multi-Cycle Pipeline Diagram Traditional form
61
Single-Cycle Pipeline Diagram State of pipeline in a given cycle
62
CENG 6332
Lecture 2: Processor and Pipelining 16
Pipelined Control (Simplified)
63
Pipelined Control Control signals derived from instruction As in single-cycle implementation
64
Pipelined Control
65
Data Hazards in ALU Instructions Consider this sequence:
sub $2, $1,$3and $12,$2,$5or $13,$6,$2add $14,$2,$2sw $15,100($2)
We can resolve hazards with forwarding How do we detect when to forward?
Data H
azards: Forwarding vs. S
talling
66
CENG 6332
Lecture 2: Processor and Pipelining 17
Dependencies & Forwarding
67
Forwarding Paths
68
Double Data Hazard Consider the sequence:
add $1,$1,$2add $1,$1,$3add $1,$1,$4
Both hazards occur Want to use the most recent
Revise MEM hazard condition Only fwd if EX hazard condition isn’t true
69
Datapath with Forwarding
70
CENG 6332
Lecture 2: Processor and Pipelining 18
Load-Use Data Hazard
Need to stall for one cycle
71
How to Stall the Pipeline Force control values in ID/EX register
to 0 EX, MEM and WB do nop (no-operation)
Prevent update of PC and IF/ID register Using instruction is decoded again Following instruction is fetched again 1-cycle stall allows MEM to read data for lw
Can subsequently forward to EX stage
72
Stall/Bubble in the Pipeline
Stall inserted here
73
Data Hazard on R1Figure A.6, Page A-17
75
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Time (clock cycles)
IF ID/RF EX MEM WB
CENG 6332
Lecture 2: Processor and Pipelining 19
Three Generic Data Hazards
76
1. Read After Write (RAW)InstrJ tries to read operand before InstrI writes it
Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.
I: add r1,r2,r3J: sub r4,r1,r3
Three Generic Data Hazards
77
2. Write After Read (WAR)InstrJ writes operand before InstrI reads it
Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.
Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
3. Write After Write (WAW)InstrJ writes operand before InstrI writes it.
Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.
Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5
Will see WAR and WAW in more complicated pipes
Three Generic Data Hazards
78
I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7
Datapath with Hazard Detection
79
CENG 6332
Lecture 2: Processor and Pipelining 20
Stalls and Performance
Stalls reduce performance But are required to get correct results
Compiler can arrange code to avoid hazards and stalls Requires knowledge of the pipeline structure
The BIG Picture
80
Branch Hazards If branch outcome determined in MEM
Control H
azards
PC
Flush theseinstructions(Set controlvalues to 0)
81
Reducing Branch Delay Move hardware to determine outcome to ID stage Target address adder Register comparator So, branch address is resolved in ID stage
Example: branch taken36: sub $10, $4, $840: beq $1, $3, 744: and $12, $2, $548: or $13, $2, $652: add $14, $4, $256: slt $15, $6, $7
...72: lw $4, 50($7)
82
Data Hazards for Branches If a comparison register is a destination of 2nd or 3rd
preceding ALU instruction
…
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
IF ID EX MEM WB
add $4, $5, $6
add $1, $2, $3
beq $1, $4, target
Can resolve using forwarding
85
CENG 6332
Lecture 2: Processor and Pipelining 21
Data Hazards for Branches If a comparison register is a destination of preceding ALU
instruction or 2nd preceding load instruction Need 1 stall cycle
beq stalled
IF ID EX MEM WB
IF ID EX MEM WB
IF ID
ID EX MEM WB
add $4, $5, $6
lw $1, addr
beq $1, $4, target
86
Data Hazards for Branches If a comparison register is a destination of immediately
preceding load instruction Need 2 stall cycles
beq stalled
IF ID EX MEM WB
IF ID
ID
ID EX MEM WB
beq stalled
lw $1, addr
beq $1, $0, target
87
Four Branch Hazard Alternatives1. Stall until branch direction is clear2. Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction
3. Predict Branch Taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS
MIPS still incurs 1 cycle branch penalty Other machines: branch target known before outcome
88
Four Branch Hazard Alternatives4. Delayed Branch Define branch to take place AFTER a following instruction
branch instructionsequential successor1sequential successor2........sequential successorn
branch target if taken
1 slot delay allows proper decision and branch target address in 5 stage pipeline
MIPS uses this
89
Branch delay of length n
CENG 6332
Lecture 2: Processor and Pipelining 22
Scheduling Branch Delay Slots (Fig A.14)
90
A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute sub when branch fails
add $1,$2,$3if $2=0 then
delay slot
A. From before branch B. From branch target C. From fall through
add $1,$2,$3if $1=0 then
delay slot
add $1,$2,$3if $1=0 then
delay slot
sub $4,$5,$6
sub $4,$5,$6
becomes becomes becomes
if $2=0 then
add $1,$2,$3add $1,$2,$3if $1=0 thensub $4,$5,$6
add $1,$2,$3if $1=0 then
sub $4,$5,$6
Delayed Branch
91
Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in
computation About 50% (60% x 80%) of slots usefully filled
Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot Delayed branching has lost popularity compared to more
expensive but more flexible dynamic approaches Growth in available transistors has made dynamic approaches
relatively cheaper
Evaluating Branch Alternatives
92
Assume 4% unconditional branch, 6% conditional branch- untaken, 10% conditional branch-taken
Scheduling Branch CPI speedup v. speedup v.scheme penalty unpipelined stall
Stall pipeline 3 1.60 3.1 1.0Predict taken 1 1.20 4.2 1.33Predict not taken 1 1.14 4.4 1.40Delayed branch 0.5 1.10 4.5 1.45
Pipeline speedup = Pipeline depth1 +Branch frequencyBranch penalty
Dynamic Branch Prediction In deeper and superscalar pipelines, branch penalty is
more significant Use dynamic prediction Branch prediction buffer (aka branch history table) Indexed by recent branch instruction addresses Stores outcome (taken/not taken) To execute a branch
Check table, expect the same outcome Start fetching from fall-through or target If wrong, flush pipeline and flip prediction
93
CENG 6332
Lecture 2: Processor and Pipelining 23
1-Bit Predictor: Shortcoming Inner loop branches mispredicted twice!
outer: ……
inner: ……beq …, …, inner…beq …, …, outer
Mispredict as taken on last iteration of inner loop
Then mispredict as not taken on first iteration of inner loop next time around
94
2-Bit Predictor Only change prediction on two successive mispredictions
95
Calculating the Branch Target Even with predictor, still need to calculate the target
address 1-cycle penalty for a taken branch
Branch target buffer Cache of target addresses Indexed by PC when instruction fetched
If hit and instruction is branch predicted taken, can fetch target immediately
96
Exceptions and Interrupts “Unexpected” events requiring change
in flow of control Different ISAs use the terms differently
Exception Arises within the CPU
e.g., undefined opcode, overflow, syscall, …
Interrupt From an external I/O controller
Dealing with them without sacrificing performance is hard
Exceptions
97
CENG 6332
Lecture 2: Processor and Pipelining 24
Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline
Less work per stage shorter clock cycle Multiple issue
Replicate pipeline stages multiple pipelines Start multiple instructions per clock cycle CPI < 1, so use Instructions Per Cycle (IPC) E.g., 4GHz 4-way multiple-issue 16 BIPS, peak CPI = 0.25, peak IPC = 4
But dependencies reduce this in practice
Parallelism
and Advanced Instruction Level P
arallelism
109
Multiple Issue Static multiple issue Compiler groups instructions to be issued together Packages them into “issue slots” Compiler detects and avoids hazards
Dynamic multiple issue CPU examines instruction stream and chooses instructions
to issue each cycle Compiler can help by reordering instructions CPU resolves hazards using advanced techniques at runtime
110
Speculation “Guess” what to do with an instruction Start operation as soon as possible Check whether guess was right
If so, complete the operation If not, roll-back and do the right thing
Common to static and dynamic multiple issue Examples Speculate on branch outcome
Roll back if path taken is different
Speculate on load Roll back if location is updated
111
Compiler/Hardware Speculation Compiler can reorder instructions e.g., move load before branch Can include “fix-up” instructions to recover from incorrect
guess
Hardware can look ahead for instructions to execute Buffer results until it determines they are actually needed Flush buffers on incorrect speculation
112
CENG 6332
Lecture 2: Processor and Pipelining 25
Static Multiple Issue Compiler groups instructions into “issue packets” Group of instructions that can be issued on a single cycle Determined by pipeline resources required
Think of an issue packet as a very long instruction Specifies multiple concurrent operations Very Long Instruction Word (VLIW)
114
Scheduling Static Multiple Issue Compiler must remove some/all hazards Reorder instructions into issue packets No dependencies with a packet Possibly some dependencies between packets
Varies between ISAs; compiler must know!
Pad with nop if necessary
115
MIPS with Static Dual Issue Two-issue packets One ALU/branch instruction One load/store instruction 64-bit aligned
ALU/branch, then load/store Pad an unused instruction with nop
Address Instruction type Pipeline Stages
n ALU/branch IF ID EX MEM WB
n + 4 Load/store IF ID EX MEM WB
n + 8 ALU/branch IF ID EX MEM WB
n + 12 Load/store IF ID EX MEM WB
n + 16 ALU/branch IF ID EX MEM WB
n + 20 Load/store IF ID EX MEM WB
116
MIPS with Static Dual Issue
117
CENG 6332
Lecture 2: Processor and Pipelining 26
Hazards in the Dual-Issue MIPS More instructions executing in parallel EX data hazard Forwarding avoided stalls with single-issue Now can’t use ALU result in load/store in same packet
add $t0, $s0, $s1load $s2, 0($t0)
Split into two packets, effectively a stall
Load-use hazard Still one cycle use latency, but now two instructions
More aggressive scheduling required
118
Scheduling Example Schedule this for dual-issue MIPS
Loop: lw $t0, 0($s1) # $t0=array elementaddu $t0, $t0, $s2 # add scalar in $s2sw $t0, 0($s1) # store resultaddi $s1, $s1,–4 # decrement pointerbne $s1, $zero, Loop # branch $s1!=0
ALU/branch Load/store cycleLoop: nop lw $t0, 0($s1) 1
addi $s1, $s1,–4 nop 2
addu $t0, $t0, $s2 nop 3
bne $s1, $zero, Loop sw $t0, 4($s1) 4
IPC = 5/4 = 1.25 (c.f. peak IPC = 2)119
Loop Unrolling Replicate loop body to expose more parallelism Reduces loop-control overhead
Use different registers per replication Called “register renaming” Avoid loop-carried “anti-dependencies”
Store followed by a load of the same register Aka “name dependence” Reuse of a register name
120
Loop Unrolling Example
IPC = 14/8 = 1.75 Closer to 2, but at cost of registers and code size
ALU/branch Load/store cycleLoop: addi $s1, $s1,–16 lw $t0, 0($s1) 1
nop lw $t1, 12($s1) 2
addu $t0, $t0, $s2 lw $t2, 8($s1) 3
addu $t1, $t1, $s2 lw $t3, 4($s1) 4
addu $t2, $t2, $s2 sw $t0, 16($s1) 5
addu $t3, $t4, $s2 sw $t1, 12($s1) 6
nop sw $t2, 8($s1) 7
bne $s1, $zero, Loop sw $t3, 4($s1) 8
121
CENG 6332
Lecture 2: Processor and Pipelining 27
Dynamic Multiple Issue “Superscalar” processors CPU decides whether to issue 0, 1, 2, … each cycle Avoiding structural and data hazards
Avoids the need for compiler scheduling Though it may still help Code semantics ensured by the CPU
122
Dynamic Pipeline Scheduling Allow the CPU to execute instructions out of order to
avoid stalls But commit result to registers in order
Examplelw $t0, 20($s2)addu $t1, $t0, $t2sub $s4, $s4, $t3slti $t5, $s4, 20
Can start sub while addu is waiting for lw
123
Dynamically Scheduled CPU
Results also sent to any waiting
reservation stations
Reorders buffer for register writes Can supply
operands for issued instructions
Preserves dependencies
Hold pending operands
124
Register Renaming Reservation stations and reorder buffer effectively
provide register renaming On instruction issue to reservation station If operand is available in register file or reorder buffer
Copied to reservation station No longer required in the register; can be overwritten
If operand is not yet available It will be provided to the reservation station by a function unit Register update may not be required
125
CENG 6332
Lecture 2: Processor and Pipelining 28
Speculation Predict branch and continue issuing Don’t commit until branch outcome determined
Load speculation Avoid load and cache miss delay
Predict the effective address Predict loaded value Load before completing outstanding stores Bypass stored values to load unit
Don’t commit load until speculation cleared
126
Why Do Dynamic Scheduling? Why not just let the compiler schedule code? Not all stalls are predicable e.g., cache misses
Can’t always schedule around branches Branch outcome is dynamically determined
Different implementations of an ISA have different latencies and hazards
127
Does Multiple Issue Work?
Yes, but not as much as we’d like Programs have real dependencies that limit ILP Some dependencies are hard to eliminate e.g., pointer aliasing
Some parallelism is hard to expose Limited window size during instruction issue
Memory delays and limited bandwidth Hard to keep pipelines full
Speculation can help if done well
The BIG Picture
128
Fallacies Pipelining is easy (!) The basic idea is easy The devil is in the details
e.g., detecting data hazards
Pipelining is independent of technology So why haven’t we always done pipelining? More transistors make more advanced techniques feasible Pipeline-related ISA design needs to take account of
technology trends e.g., predicated instructions
Fallacies and Pitfalls
132
CENG 6332
Lecture 2: Processor and Pipelining 29
Pitfalls Poor ISA design can make pipelining harder e.g., complex instruction sets (VAX, IA-32)
Significant overhead to make pipelining work IA-32 micro-op approach
e.g., complex addressing modes Register update side effects, memory indirection
e.g., delayed branches Advanced pipelines have long delay slots
133
Concluding Remarks ISA influences design of datapath and control Datapath and control influence design of ISA Pipelining improves instruction throughput
using parallelism More instructions completed per second Latency for each instruction not reduced
Hazards: structural, data, control Multiple issue and dynamic scheduling (ILP) Dependencies limit achievable parallelism Complexity leads to the power wall
Concluding R
emarks
134
Summary of Pipelining
135
MIPS – An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion
A "Typical" RISC ISA
136
32-bit fixed format instruction (3 formats) 32 32-bit GPR (R0 contains zero, DP take pair) 3-address, reg-reg arithmetic instruction Single address mode for load/store:
base + displacement no indirection
Simple branch conditions Delayed branch
see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC,CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
CENG 6332
Lecture 2: Processor and Pipelining 30
Example: MIPS ( MIPS)
137
Op31 26 01516202125
Rs1 Rd immediate
Op31 26 025
Op31 26 01516202125
Rs1 Rs2
target
Rd Opx
Register-Register561011
Register-Immediate
Op31 26 01516202125
Rs1 Rs2/Opx immediate
Branch
Jump / Call
Datapath vs Control
138
Datapath: Storage, FU, interconnect sufficient to perform the desired functions Inputs are Control Points Outputs are signals
Controller: State machine to orchestrate operation on the data path Based on desired function and signals
Datapath Controller
Control Points
signals
5 Steps of MIPS DatapathFigure A.3, Page A-9
140
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MU
XM
UX
Data
Mem
ory
MU
X
SignExtend
Zero?
IF/ID
ID/EX
MEM
/WB
EX/M
EM
4
Adder
Next SEQ PC Next SEQ PC
RD RD RD WB
Dat
a
Next PC
Address
RS1
RS2
Imm
MU
X
IR <= mem[PC]; PC <= PC + 4
A <= Reg[IRrs]; B <= Reg[IRrt]
rslt <= A opIRop B
Reg[IRrd] <= WB
WB <= rslt
Inst. Set Processor Controller
141
IR <= mem[PC]; PC <= PC + 4
A <= Reg[IRrs]; B <= Reg[IRrt]
r <= A opIRop B
Reg[IRrd] <= WB
WB <= r
Ifetch
opFetch-DCD
PC <= IRjaddrif bop(A,b)PC <= PC+IRim
br jmpRR
r <= A opIRop IRim
Reg[IRrd] <= WB
WB <= r
RIr <= A + IRim
WB <= Mem[r]
Reg[IRrd] <= WB
LD
STJSR
JR
CENG 6332
Lecture 2: Processor and Pipelining 31
Visualizing PipeliningFigure A.2, Page A-8
142
Instr.
Order
Time (clock cycles)
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Pipelining is not quite that easy!
143
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of
instructions (single person to fold and put clothes away) Data hazards: Instruction depends on result of prior instruction
still in the pipeline (missing sock) Control hazards: Caused by delay between the fetching of
instructions and decisions about changes in control flow (branches and jumps).
One Memory Port/Structural HazardsFigure A.4, Page A-14
144
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Reg ALU DMemIfetch Reg
One Memory Port/Structural Hazards(Similar to Figure A.5, Page A-15)
145
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Stall
Instr 3
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7Cycle 5
Reg ALU DMemIfetch Reg
Bubble Bubble Bubble BubbleBubble
How do you “bubble” the pipe?
CENG 6332
Lecture 2: Processor and Pipelining 32
Speed Up Equation for Pipelining
146
pipelined
dunpipeline
TimeCycle TimeCycle
CPI stall Pipeline CPI Ideal
depth Pipeline CPI Ideal Speedup
pipelined
dunpipeline
TimeCycle TimeCycle
CPI stall Pipeline 1
depth Pipeline Speedup
Instper cycles Stall Average CPI Ideal CPIpipelined
For simple RISC pipeline, CPI = 1:
Example: Dual-port vs. Single-port
147
Machine A: Dual ported memory (“Harvard Architecture”) Machine B: Single ported memory, but its pipelined
implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed
SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe)
= Pipeline Depth
SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05)
= (Pipeline Depth/1.4) x 1.05
= 0.75 x Pipeline Depth
SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33
Machine A is 1.33 times faster
Data Hazard on R1Figure A.6, Page A-17
148
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Time (clock cycles)
IF ID/RF EX MEM WB
Three Generic Data Hazards
149
1. Read After Write (RAW)InstrJ tries to read operand before InstrI writes it
Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.
I: add r1,r2,r3J: sub r4,r1,r3
CENG 6332
Lecture 2: Processor and Pipelining 33
Three Generic Data Hazards
150
2. Write After Read (WAR)InstrJ writes operand before InstrI reads it
Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.
Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
3. Write After Write (WAW)InstrJ writes operand before InstrI writes it.
Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.
Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5
Will see WAR and WAW in more complicated pipes
Three Generic Data Hazards
151
I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7
Forwarding to Avoid Data HazardFigure A.7, Page A-19
152
Time (clock cycles)
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
HW Change for ForwardingFigure A.23, Page A-37
153
MEM
/WR
ID/EX
EX/M
EM
DataMemory
ALU
mux
mux
Registers
NextPC
Immediate
mux
What circuit detects and resolves this hazard?
CENG 6332
Lecture 2: Processor and Pipelining 34
Forwarding to Avoid LW-SW Data HazardFigure A.8, Page A-20
154
Time (clock cycles)
Instr.
Order
add r1,r2,r3
lw r4, 0(r1)
sw r4,12(r1)
or r8,r6,r9
xor r10,r9,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Data Hazard Even with ForwardingFigure A.9, Page A-21
155
Time (clock cycles)
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Data Hazard Even with Forwarding(Similar to Figure A.10, Page A-21)
156
Time (clock cycles)
or r8,r1,r9
Instr.
Order
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
Reg ALU DMemIfetch Reg
RegIfetch ALU DMem RegBubble
Ifetch ALU DMem RegBubble Reg
Ifetch ALU DMemBubble Reg
How is this detected?
Try producing fast code fora = b + c;d = e – f;
assuming a, b, c, d ,e, and f in memory. Slow code:
LW Rb,bLW Rc,cADD Ra,Rb,RcSW a,Ra LW Re,e LW Rf,fSUB Rd,Re,RfSW d,Rd
Software Scheduling to Avoid Load Hazards
157
Fast code:LW Rb,bLW Rc,cLW Re,eADD Ra,Rb,RcLW Rf,fSW a,RaSUB Rd,Re,RfSW d,Rd
Compiler optimizes for performance. Hardware checks for safety.
CENG 6332
Lecture 2: Processor and Pipelining 35
Outline
158
MIPS – An ISA for Pipelining 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion
Control Hazard on BranchesThree Stage Stall
159
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch Reg
Reg ALU DMemIfetch
What do you do with the 3 instructions in between?How do you do it?
Where is the “commit”?
Branch Stall Impact
160
If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9!
Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier
MIPS branch tests if register = 0 or 0 MIPS Solution:
Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3
Pipelined MIPS DatapathFigure A.24, page A-38
161
Adder
IF/ID
MemoryAccess
WriteBack
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MU
X
Data
Mem
ory
MU
X
SignExtend
Zero?
MEM
/WB
EX/M
EM
4
Adder
Next SEQ PC
RD RD RD WB
Dat
a
• Interplay of instruction set design and cycle time.
Next PC
Address
RS1
RS2
Imm
MU
X
ID/EX
CENG 6332
Lecture 2: Processor and Pipelining 36
Four Branch Hazard Alternatives1. Stall until branch direction is clear2. Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction
3. Predict Branch Taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS
MIPS still incurs 1 cycle branch penalty Other machines: branch target known before outcome
162
Four Branch Hazard Alternatives4. Delayed Branch Define branch to take place AFTER a following instruction
branch instructionsequential successor1sequential successor2........sequential successorn
branch target if taken
1 slot delay allows proper decision and branch target address in 5 stage pipeline
MIPS uses this
163
Branch delay of length n
Scheduling Branch Delay Slots (Fig A.14)
164
A is the best choice, fills delay slot & reduces instruction count (IC) In B, the sub instruction may need to be copied, increasing IC In B and C, must be okay to execute sub when branch fails
add $1,$2,$3if $2=0 then
delay slot
A. From before branch B. From branch target C. From fall through
add $1,$2,$3if $1=0 then
delay slot
add $1,$2,$3if $1=0 then
delay slot
sub $4,$5,$6
sub $4,$5,$6
becomes becomes becomes
if $2=0 then
add $1,$2,$3add $1,$2,$3if $1=0 thensub $4,$5,$6
add $1,$2,$3if $1=0 then
sub $4,$5,$6
Delayed Branch
165
Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in
computation About 50% (60% x 80%) of slots usefully filled
Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot Delayed branching has lost popularity compared to more
expensive but more flexible dynamic approaches Growth in available transistors has made dynamic approaches
relatively cheaper
CENG 6332
Lecture 2: Processor and Pipelining 37
Evaluating Branch Alternatives
166
Assume 4% unconditional branch, 6% conditional branch- untaken, 10% conditional branch-taken
Scheduling Branch CPI speedup v. speedup v.scheme penalty unpipelined stall
Stall pipeline 3 1.60 3.1 1.0Predict taken 1 1.20 4.2 1.33Predict not taken 1 1.14 4.4 1.40Delayed branch 0.5 1.10 4.5 1.45
Pipeline speedup = Pipeline depth1 +Branch frequencyBranch penalty
Problems with Pipelining
167
Exception: An unusual event happens to an instruction during its execution Examples: divide by zero, undefined opcode
Interrupt: Hardware signal to switch the processor to a new instruction stream Example: a sound card interrupts when it needs more audio
output samples (an audio “click” happens if it is left waiting) Problem: It must appear that the exception or interrupt
must appear between 2 instructions (Ii and Ii+1) The effect of all instructions up to and including Ii is totalling
complete No effect of any instruction after Ii can take place
The interrupt (exception) handler either aborts program or restarts at instruction Ii+1
In Conclusion: Control and Pipelining
170
Just overlap tasks; easy if tasks are independent Speed Up Pipeline Depth; if ideal CPI is 1, then:
Hazards limit performance on computers: Structural: need more HW resources Data (RAW,WAR,WAW): need forwarding, compiler scheduling Control: delayed branch, prediction
Exceptions, Interrupts add complexity
pipelined
dunpipeline
TimeCycle TimeCycle
CPI stall Pipeline 1
depth Pipeline Speedup