csci 620 1 order of class lectures chapter 2 starts with instruction-level parallelism: concepts and...
Post on 21-Dec-2015
224 views
TRANSCRIPT
CSCI 620 1
Order of Class Lectures
• Chapter 2 starts with
Instruction-Level Parallelism: Concepts and Challenges
Instruction Level Parallelism (ILP)
• Definition: Potential to overlap the execution of instructions
– Pipelining is one example
– Limitations of ILP are from data and control hazards
• Approaches to overcoming limitations– dynamic approaches with hardware
– static approaches that use software
• So, we will cover ISA(Appendix B) & Pipelining first (Appendix A) then Chapter 2
CSCI 620 2
Instruction SetPrinciples and Examples
(Appendix B)
CSCI 620 3
Review: Instruction Set Design Parameters
• Operand storage in the CPU: Where are operands kept other than in memory? Registers
• Number of explicit operands named per instruction: How many operands are named explicitly in a typical instruction? 0 to 3
• Operand location: Can any ALU operand be located in memory or must some or all of the operands be internal storage in the CPU? If an operand is located in memory, how is the memory location specified? Most popular addressing: Displacement, Immediate, Register Indirect
• Operations: What operations are provided in the instruction set?
Most often used are: Arithmetic & Logic, Data transfer, Control
• Type and size of operands: What is the type and size of each operand and how is it specified? 8, 16, 32, 64bits
CSCI 620 4
Current Design Guidelines
• Use general-purpose registers with a load-store architecture
• Support these addressing modes: displacement, immediate, and register Indirect
• Use a minimalist instruction set
• Support simple, most-commonly used instructions
• Support standard data sizes and types: 8-, 16-, and 32-bit integers and 64-bit IEEE 754 floating-point numbers
• Use fixed instruction encoding if interested in performance and variable instruction encoding if interested in code size
• Provide at least 16 general-purpose registers plus separate floating-point registers; 32 registers of each highly desirable
CSCI 620 5
The Big Picture: The Performance Perspective
• Performance of a machine is determined by:– Instruction count
– Clock cycle time
– Clock cycles per instruction
• Processor design (datapath and control) will determine:– Clock cycle time
– Clock cycles per instruction
Clock cycle time
CSCI 620 6
A "Typical" RISC ISA
• 32-bit fixed format instruction (3 formats)
• 32 32-bit GPR (R0 contains zero, DP take pair)
• 3-address, reg-reg arithmetic instruction
• Single address mode for load/store: base + displacement
– no indirection
• Simple branch conditions
• Delayed branch
see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3
CSCI 620 7
Basic MIPS RISC Instruction Set
• All operations on data apply to data in registers
• Only operations that affect memory are load and store operations that move data from memory to a register or to memory from a register—Therefore, it is called “load & store” machine
• Instruction formats are few in number with all instructions typically being one size—simpler decoding faster
• 32 registers
• 3 classes of instructions: ALU, Load and Store, Branches and jumps
CSCI 620 8
MIPS Instruction Format Overview
CSCI 620 9
I-type Instructions
Examples:lw R1, 30(R2) Load Word Regs[R1] Mem[30+Regs[R2]]opcode = load word, displacement rs = R2, rt = R1, Immediate = 30
swcl F0, 40(R3) Store FP Single Mem[40+Regs[R3]] 32 Regs[F0]0..31
opcode = store FP single, displacement rs = F0, rt = R3, Immediate = 40
beq R4, R3, name Branch on equal if (Regs[R4] == Reg[R3]) PC PC+4+name
opcode = branch on equal, immediate rs = R4, rt = R3, Immediate = name
CSCI 620 10
R-type Instructions
Examples: add R1, R2, R3 Add Regs[R1] Regs[R2] + Regs[R3]Opcode = R-type register mode rd = R1, rs = R2, rt = R3
shamt = 0, funct = ADD
slt R1, R2, R3 Set less than if (Regs[R2]<Regs[R3]) ThenOpcode=R-type register mode Regs[R1] 1 else Regs[R1] 0
rd = R1, rs = R2, rt=R3shamt = 0, funct = SLT
Sll R1, R2, 10 shift left logical Regs[R1] Regs[R2]<<10Opcode = R-type register mode rd=R1, rs=0, rt=R3,
shamt=10, funct=SLL
CSCI 620 11
J-type Instructions
Examples:j name Jump PC name (jump address)opcode = jump Offset = name
jal name Jump and link PC name, Regs[R31] PC+4opcode = jump and link Offset = name
There is two more instruction formats for floating point; both are 32 bit fixed formats
CSCI 620 12
Most Popular MIPS Instructions
Integer benchmarks
CSCI 620 13
Most Popular MIPS Instructions
Floating-point benchmarks
CSCI 620 14
Implementation of MIPS RISC Instruction Set
• Instruction fetch cycle (IF)–Send PC to memory
–Fetch current instruction from memory
–Update PC (PC PC + 4)
• Instructions decode/register fetch cycle (ID)– Decode instruction
– Read registers corresponding to register source specifiers from register file (in parallel with decoding)
–Look for branch conditions, act accordingly
CSCI 620 15
Implementation of MIPS RISC Instruction Set--continued
• Execution/effective address cycle (EX)
–ALU operates on operands prepared from prior cycle, then performs one of three things…
– Memory reference: ALU adds base register and offset to form effective address
–Register-register ALU instruction: ALU does operation specified by ALU opcode on values read from register file
–Register-immediate ALU instruction in which ALU does operation specified by ALU opcode on first value read from register file + sign extended immediate
CSCI 620 16
Implementation of MIPS RISC Instruction Set--continued
• Memory Access (MEM)– Performs read using effective address if instruction is a load
– Performs write of data from second register read from register file using effective address if instruction is a store
• Write-back Cycle (WB)– Write to register file for either register-register ALU instruction or load instruction
CSCI 620 17
Pipelining: Basic and Intermediate Concepts
(Appendix A)
CSCI 620 18
Datapath vs Control
• Datapath: Storage, FU, interconnect sufficient to perform the desired functions
– Inputs are Control Points– Outputs are signals
• Controller: State machine to orchestrate operation on the data path– Based on desired function and signals
Datapath Controller
Control Points
signals
CSCI 620 19
Implementation of Single cycle machine
From this text
Lw $t0, 32($3)
CSCI 620 20
Single cycle machine with Control logic
CSCI 620 21
Division of execution into 5 stages
What factors to consider in the division?
CSCI 620 22
These are pipeline registers Why do we need them?
CSCI 620 23
Approaching an ISA
• Instruction Set Architecture– Defines set of operations, instruction format, hardware supported data
types, named storage, addressing modes, sequencing
• Meaning of each instruction is described by RTL (Register Transfer Language) on architected registers and memory
• Given technology constraints assemble adequate datapath– Architected storage mapped to actual storage
– Function units to do all the required operations
– Possible additional storage (eg. MAR, MBR, …)
– Interconnect to move information among regs and FUs
• Map each instruction to sequence of RTLs
• Collate sequences into symbolic controller state transition diagram (STD)
• Lower symbolic STD to control points
• Implement controller
CSCI 620 24
Visualizing Pipelining
Instr.
Order
Time (clock cycles)
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
Pipelining and Hazards
CSCI 620 26
Hazards
Hazards are situations that hamper execution flow
• Structural Hazards:– Resource Conflict, hardware cannot support all possible
combinations of instructions simultaneously. E.g. Fetch instruction & Fetch data simultaneously from one memory
• Data Hazards:– Source operands are not available: instruction depends on results of
previous instructions still in the pipeline
• Control Hazards:– Changes in program counter—jumps, calls, interrupts, etc.– When branch happens, what happens to the instructions already in
pipeline?
CSCI 620 27
Structural Hazards—One Memory Port/Structural HazardsFigure A.4, Page A-14
Memory conflict
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Instr 3
Instr 4
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
Reg
ALU
DMemIfetch Reg
CSCI 620 28
One Memory Port/Structural Hazards(Similar to Figure A.5, Page A-15)
Instr.
Order
Time (clock cycles)
Load
Instr 1
Instr 2
Stall
Instr 3
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Cycle 1Cycle 2 Cycle 3Cycle 4 Cycle 6Cycle 7Cycle 5
Reg
ALU
DMemIfetch Reg
Bubble Bubble Bubble BubbleBubble
How do you “bubble” the pipe?
CSCI 620 29
Structural Hazard: Single Memory—another view
Clock cycle number
Instruction 1 2 3 4 5 6 7 8 9 10Load IF ID EX MEM WBInstr. 1 IF ID EX MEM WBInstr. 2 IF ID EX MEM WBInstr. 3 Stall IF ID EX MEM WBInstr. 4 IF ID EX MEM WBInstr. 5 IF ID EX MEMInstr. 6 IF ID EX
Structural Hazards occur in which cycles?Whenever IF(Instruction Fetch) & MEM(Memory access—read or write) occur together, they are candidates for the structural Hazards
In this example, we assume that only the first instruction needs(Load) to access memory at MEM cycle.
Structural Hazards can be solved by “duplicating the hardware”. e.g. Dual-port memory, multiple ALUs
CSCI 620 30
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Data Hazard on R1Figure A.6, Page A-17
Time (clock cycles)
IF ID/RF EX MEM WB
CSCI 620 31
Classification of Data Hazards
Consider instructions i and j, where i occurs before j.The possible Data Hazards are:
• RAW (read after write) — j tries to read a source before i writes it, so j gets the old value. The most common type
• WAR (write after read) — j tries to write a destination before it is read by i, so i incorrectly gets the new value (only possible when some instructions can write results early in the pipeline and other instructions can read sources late in the pipeline—in MIPS pipeline this hazard cannot happen)
• WAW (write after write) — j tries to write an operand before it is written by i (only possible in pipelines that write in more than one pipe stage or allow an instruction to proceed even when a previous instruction is stalled—in MIPS pipeline this hazard cannot happen)
CSCI 620 32
• Read After Write (RAW) InstrJ tries to read operand before InstrI writes it
• Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication.
Three Generic Data Hazards
I: add r1,r2,r3J: sub r4,r1,r3
CSCI 620 33
• Write After Read (WAR) InstrJ writes operand before InstrI reads it
• Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Reads are always in stage 2, and
– Writes are always in stage 5
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
Three Generic Data Hazards
CSCI 620 34
Three Generic Data Hazards
• Write After Write (WAW) InstrJ writes operand before InstrI writes it.
• Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”.
• Can’t happen in MIPS 5 stage pipeline because:
– All instructions take 5 stages, and
– Writes are always in stage 5
• Will see WAR and WAW in more complicated pipes
I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7
CSCI 620 35
Software Solution to data hazards
Compiler recognizes data hazard and adds nops to
eliminate it—simple but …
sub R2, R1, R3 ; register R2 written by sub
nop ; no operation
nop
nop
and R12, R2, R5 ; now, result from sub available
or R13, R6, R2
add R14, R2, R2
sw 100 (R2), R15
Any problem with this solution? Yes, waste of time
CSCI 620 36
Data Hazard Control: Stalls by hardware
• Hazard occurs when instruction reads (in ID stage) register that will be written by an earlier instruction (in WB stage)
• Idea: Detect hazard and stall instructions in pipeline until hazard is resolved
• Detect hazard by comparing read fields in IF/ID pipeline register with write fields in later pipeline registers (ID/EX, EX/MEM, MEM/WB)
• To add bubble in pipeline– Preserve PC register and IF/ID pipeline register– Change EX, MEM, and WB control fields of ID/EX
pipeline register to do nothing
CSCI 620 37
Data Hazard Reduction: Forwarding
• Needed result is available before it is written into register file in WB stage
• Idea: Use temporary results instead of waiting for registers to be written
• Cannot solve problem of write (load) immediately followed by read
• Almost all pipelined machines today use some form of forwarding
CSCI 620 38
sub r4, r1, r3
add r1, r2, r3
and r6, r1, r7
or r8, r1, r9
xor r10, r1, r11
Data Hazard on r1
Instr.
Order
Time (clock cycles)
IM Reg DM Reg
IM Reg DM Reg
IM Reg DM
IM Reg
IM Reg
CC 1 CC 5CC 2 CC 3 CC 4 CC 6
R1 is changed here
r1 is changed here
Are both hazards?
CSCI 620 39
Forwarding to Avoid Data Hazard
Instr.
Order
IM Reg DM Reg
IM Reg DM Reg
IM Reg DM
IM Reg
IM Reg
Time (clock cycles)CC 1 CC 5CC 2 CC 3 CC 4 CC 6
sub r4, r1, r3
add r1, r2, r3
and r6, r1, r7
or r8, r1, r9
xor r10, r1, r11
This can be done by writing on the rising edge of the clock and reading on the falling edge
CSCI 620 40
Data Hazard Even with Forwarding
Instr.
Order
IM Reg DM
IM Reg DM Reg
IM Reg
IM Reg
Time (clock cycles)CC 1 CC 5CC 2 CC 3 CC 4
sub r4, r1, r5
lw r1, 0(r2)
and r6, r1, r7
or r8, r1, r9This can’t be done because it means forwarding the result in “negative time” So, we have to stall the pipeline
See next page
CSCI 620 41
Data Hazard Even with Forwarding
Instr.
Order
IM Reg
IM Reg DM Reg
IM
DM
Reg
IM Reg
Time (clock cycles)CC 1 CC 5CC 2 CC 3 CC 4 CC 6
sub r4, r1, r5
lw r1, 0(r2)
and r6, r1, r7
or r8, r1, r9
bubble
bubble
bubble
CSCI 620 42
44 and R12, R2, R5
40 beqz R1, 36
48 or R13, R6, R2
52 add R14, R2, R2
80 ld R4, R7, 100
Control Hazard on BranchesThree Stage Stall
Pro
gram
Exe
cuti
on O
rder
(in
inst
ruct
ions
)
Time (clock cycles)CC 1 CC 5CC 2 CC 3 CC 4 CC 6
DMIM
DMIM
RegDMIM
IM DM
CC 7 CC 8 CC 9
IM DM
Reg
Reg
Reg
Reg
Reg
Reg
Reg
Reg
Reg
CSCI 620 43
Control Hazard on BranchesThree Stage Stall
10: beq r1,r3,36
14: and r2,r3,r5
18: or r6,r1,r7
22: add r8,r1,r9
36: xor r10,r1,r11
Reg ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch Reg
Reg
ALU
DMemIfetch
What do you do with the 3 instructions in between?
How do you do it?
Where is the “commit”?
CSCI 620 44
Ad
der
IF/ID
Pipelined MIPS Datapathwith branch logic in 2nd stageFigure A.24, page A-38
MemoryAccess
Write
Back
InstructionFetch
Instr. DecodeReg. Fetch
ExecuteAddr. Calc
ALU
Mem
ory
Reg File
MU
X
Data
Mem
ory
MU
X
SignExtend
Zero?
MEM
/WB
EX
/MEM
4
Ad
der
Next SEQ PC
RD RD RD WB
Data
Next PC
Addre
ss
RS1
RS2
ImmM
UX
ID/E
X
CSCI 620 45
Branch Characteristics
• Integer Benchmarks: 14 – 16% of instructions are conditional branches
• Floating Point: 3 – 12%
• On average:
--67% of conditional branches are “taken”
--60% of forward branches are taken
--85% of backward branches are taken
CSCI 620 46
Solutions to Control Hazards
1. Stall Pipeline until branch is decided-- Simple, no other things to do, no problems with Exceptions
2. Assume(predict) Not Taken-- Execute successor instructions in sequence– “Squash” instructions in pipeline if branch actually taken– Requires back-out logic in case Taken– PC+4 already calculated, so use it to get next instruction
-- As we have seen above, most branches are taken, so “wrong predictions” may slow down
3. Assume Taken--no advantage for the simple pipe like MIPS as the branch address is not known until the MEM stage.
Solutions 2 & 3 are simple predictions—heavy penalty with a wrong prediction, so “dynamic branch predictions”—next slide
4. Delayed Branch—a software solution--Can help a little
CSCI 620 47
A Dynamic Branch Prediction• Uses some kind of “branch history table”
• Try to see the trend for branch behavior (taken or not taken) of the current codes
The states in a 2-bit prediction scheme
CSCI 620 48
Delayed Branch by compiler
(a) is the best choice if possible
Strategies (b) & (c) are used when (a) is not possible
CSCI 620 49
Delayed Branch
• Where to get instructions to fill branch delay slot?– Before branch instruction
– From the target address: only valuable when branch taken
– From fall through: only valuable when branch not taken
• Compilers’ effectiveness for single branch delay slot (as in MIPS): – Fills about 60% of branch delay slots
– About 80% of instructions executed in branch delay slots useful in computation
– About 50% (60% x 80%) of slots usefully filled with success
• Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot
– Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches
– Growth in available transistors has made dynamic approaches relatively cheaper
CSCI 620 50
Delayed Branch• The following is quoted from
http://www.cs.umd.edu/class/fall2001/cmsc411/projects/branches/delay.html
• So, if this strategy offers improvements irregardless of whether we take or do not take the branch, what is the problem? The problem is trying to find an instruction that can both be safely executed whether the branch is taken or not, and will still improve performance. This is the compiler's job, and so using a branch delay slot makes compilers more complex to program. Also, Hennesy and Patterson mention that using this option does cause one shortcoming, if the hardware is changed so that a delay-branch slot is no longer used, all the old programs will no longer work. C programs would have to be recompiled, and assembly language programs and routines would have to be re-written. So, this method does put a lot more work into the hands of the system programmers.
CSCI 620 51
Speedup of pipeline with branches
Pipeline Speedup = Pipeline depth
1 + Pipeline stalls
= Pipeline depth
1 + Branch frequency Branch penalty
CSCI 620 52
PipelineHazards
Caused by Solved by
Structural Hazard
Simultaneous need for the same hardware components by different pipeline stagesEx. ALU, memory are needed by several stages
*Duplicated hardware Ex. multiple ALUs, Separate Instruction memory & Data memory
Data Hazard
Dependencies of data between instructions
*Stalling(=flushing)*Forwarding (=bypassing)
Control Hazard
Branch instructions *Stalling(=flushing)*Branch prediction -static (taken/not taken) -dynamic branch predictions*Delayed branch (software)