week 11 introduction to pipelined...
TRANSCRIPT
Spring 2005331 W11.1
14:332:331Computer Architecture and Assembly Language
Spring 2005
Week 11Introduction to Pipelined Datapath
[Adapted from Dave Patterson’s UCB CS152 slides and
Mary Jane Irwin’s PSU CSE331 slides]
Spring 2005331 W11.2
Head’s UpReminders
Pipelined datapath and controlHW#6 will be handed out soon
Spring 2005331 W11.3
Review: Multicycle Data and Control Path
Address
Read Data(Instr. or Data)
Memory
PC
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
ALU
Write Data
IRM
DR
AB
ALU
out
SignExtend
Shiftleft 2 ALU
control
Shiftleft 2
ALUOpControl
FSM
IRWriteMemtoReg
MemWriteMemRead
IorDPCWrite
PCWriteCond
RegDstRegWrite
ALUSrcAALUSrcB
zero
PCSource
1
1
1
1
1
10
0
0
0
0
0
2
2
3
4
Instr[5-0]
Instr[25-0]
PC[31-28]
Instr[15-0]
Instr[31-26]
32
28
Spring 2005331 W11.4
Review: RTL SummaryStep R-type Mem Ref Branch Jump
Instrfetch
IR = Memory[PC]; PC = PC + 4;
Decode A = Reg[IR[25-21]];B = Reg[IR[20-16]];
ALUOut = PC +(sign-extend(IR[15-0])<< 2);
Execute ALUOut = A op B;
ALUOut = A + sign-extend (IR[15-0]);
if (A==B) PC =
ALUOut;
PC = PC[31-28] ||(IR[25-0]
<< 2);
Memory access
Reg[IR[15-11]] = ALUOut;
MDR = Memory[ALUOut];
orMemory[ALUOut]
= B;
Write-back
Reg[IR[20-16]] = MDR;
Spring 2005331 W11.5
Review: Multicycle Datapath FSM
Start
Instr Fetch Decode
Write Back
Memory Access
Execute
(Op = R-
type)
(Op =
beq)
(Op = lw or sw) (Op = j)
(Op = lw)(Op = sw)
0 1
2
3
4
5
6
7
8 9
Unless otherwise assigned
PCWrite,IRWrite,MemWrite,RegWrite=0others=X
IorD=0MemRead;IRWrite
ALUSrcA=0ALUsrcB=01
PCSource,ALUOp=00PCWrite
ALUSrcA=0ALUSrcB=11ALUOp=00
PCWriteCond=0
ALUSrcA=1ALUSrcB=10ALUOp=00
PCWriteCond=0
ALUSrcA=1ALUSrcB=00ALUOp=10
PCWriteCond=0
ALUSrcA=1ALUSrcB=00ALUOp=01
PCSource=01PCWriteCond
PCSource=10PCWrite
MemReadIorD=1
PCWriteCond=0
MemWriteIorD=1
PCWriteCond=0
RegDst=1RegWriteMemtoReg=0
PCWriteCond=0
RegDst=0RegWriteMemtoReg=1
PCWriteCond=0
Spring 2005331 W11.6
Review: FSM Implementation
Combinationalcontrol logic
State RegNextState
Inst[31-26]
Inputs
Out
puts
Op0
Op1
Op2
Op3
Op4
Op5
PCWritePCWriteCondIorDMemReadMemWriteIRWriteMemtoRegPCSourceALUOpALUSourceBALUSourceARegWriteRegDst
System Clock
Spring 2005331 W11.7
Single Cycle Disadvantages & AdvantagesUses the clock cycle inefficiently – the clock cycle must be timed to accommodate the slowestinstruction
Is wasteful of area since some functional units must (e.g., adders) be duplicated since they can not be shared during a clock cycle
but
Is simple and easy to understand
Clk
Single Cycle Implementation:
lw sw Waste
Cycle 1 Cycle 2
Spring 2005331 W11.8
Multicycle Advantages & DisadvantagesUses the clock cycle efficiently – the clock cycle is timed to accommodate the slowest instruction step
balance the amount of work to be done in each steprestrict each step to use only one major functional unit
Multicycle implementations allowfunctional units to be used more than once per instruction as long as they are used on different clock cyclesfaster clock ratesdifferent instructions to take a different number of clock cycles
but
Requires additional internal state registers, muxes, and more complicated (FSM) control
Spring 2005331 W11.9
The Five Stages of Load InstructionCycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
IFetch Dec Exec Mem WBlw
IFetch: Instruction Fetch and Update PC
Dec: Registers Fetch and Instruction Decode
Exec: Execute R-type; calculate memory address
Mem: Read/write the data from/to the Data Memory
WB: Write the data back to the register file
Spring 2005331 W11.10
Single Cycle vs. Multiple Cycle TimingSingle Cycle Implementation:
Clk Cycle 1
Multiple Cycle Implementation:
IFetch Dec Exec Mem WB
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10
IFetch Dec Exec Memlw sw
Clk
lw sw Waste
IFetchR-type
Cycle 1 Cycle 2
multicycle clock slower than 1/5th of single cycle clock due to stage flipflop overhead
Spring 2005331 W11.11
Pipelined MIPS ProcessorStart the next instruction while still working on the current one
improves throughput - total amount of work done in a given time
instruction latency (execution time, delay time, response time) is not reduced - time from the start of an instruction to its completion
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
IFetch Dec Exec Mem WBlw
Cycle 7Cycle 6 Cycle 8
sw IFetch Dec Exec Mem WB
R-type IFetch Dec Exec Mem WB
Spring 2005331 W11.12
Single Cycle, Multiple Cycle, vs. Pipeline
Clk
Cycle 1
Multiple Cycle Implementation:
IFetch Dec Exec Mem WB
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9Cycle 10
lw IFetch Dec Exec Mem WB
IFetch Dec Exec Memlw sw
Pipeline Implementation:
IFetch Dec Exec Mem WBsw
Clk
Single Cycle Implementation:
Load Store Waste
IFetchR-type
Cycle 1 Cycle 2
wasted cycle
IFetch Dec Exec Mem WBR-type
Spring 2005331 W11.13
Pipelining the MIPS ISAWhat makes it easy
all instructions are the same length (32 bits)few instruction formats (three) with symmetry across formatsmemory operations can occur only in loads and storesoperands must be aligned in memory so a single data transfer requires only one memory access
What makes it hardstructural hazards: what if we had only one memorycontrol hazards: what about branchesdata hazards: what if an instruction’s input operands depend on the output of a previous instruction
Spring 2005331 W11.14
MIPS Pipeline Datapath ModificationsWhat do we need to add/modify in our MIPS datapath?
State registers between pipeline stages to isolate them
ReadAddress
InstructionMemory
Add
PC
4
0
1
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
16 32
ALU
1
0
Shiftleft 2
Add
DataMemoryAddress
Write Data
ReadData
1
0
IFet
ch/D
ec
Dec
/Exe
c
Exec
/Mem
Mem
/WB
IFetch Dec Exec Mem WB
System Clock
SignExtend
Spring 2005331 W11.15
MIPS Pipeline Control Path ModificationsAll control signals are determined during Decode
and held in the state registers between pipeline stages
ReadAddress
InstructionMemory
Add
PC
4
0
1
Write Data
Read Addr 1
Read Addr 2
Write Addr
Register
File
ReadData 1
ReadData 2
16 32
ALU
1
0
Shiftleft 2
Add
DataMemoryAddress
Write Data
ReadData
1
0
WB
IFet
ch/D
ec
Dec
/Exe
c
Exec
/Mem
Mem
/WB
IFetch Dec Exec Mem
System Clock
Control
SignExtend
Spring 2005331 W11.16
Graphically Representing MIPS Pipeline
ALUIM Reg DM Reg
Can help with answering questions like:how many cycles does it take to execute this code?what is the ALU doing during cycle 4?is there a hazard, why does it occur, and how can it be fixed?
Spring 2005331 W11.17
Why Pipeline? For Throughput!Time (clock cycles)
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
Once the pipeline is full, one instruction is completed every cycle
Time to fill the pipeline
Inst 0
Inst 1
Inst 2
Inst 4
Inst 3
Instr.
Order
Spring 2005331 W11.18
Can pipelining get us into trouble?Yes: Pipeline Hazards
structural hazards: attempt to use the same resource by two different instructions at the same timedata hazards: attempt to use item before it is ready
- instruction depends on result of prior instruction still in the pipeline
control hazards: attempt to make a decision before condition is evaulated
- branch instructions
Can always resolve hazards by waitingpipeline control must detect the hazardtake action (or delay action) to resolve hazards
Spring 2005331 W11.19
A Unified Memory Would Be a Structural HazardTime (clock cycles)
ALUMem Reg Mem Reg
ALUMem Reg Mem Reg
ALUMem Reg Mem Reg
ALUMem Reg Mem Reg
ALUMem Reg Mem Reg
Reading data from memory
Reading instruction from memory
lw
Inst 1
Inst 2
Inst 4
Inst 3
Instr.
Order
Spring 2005331 W11.20
How About Register File Access?Time (clock cycles)
Instr.
Order
add
Inst 1
Inst 2
Inst 4
add
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
Can fix register file access hazard by doing reads in the second half of the cycle and writes in the first half.
Spring 2005331 W11.21
Branch Instructions Cause Control HazardsDependencies backward in time cause hazards
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
add
beq
lw
Inst 4
Inst 3
Instr.
Order
Spring 2005331 W11.22
One Way to “Fix” a Control Hazard
Instr.
Order
add
beq
ALUIM Reg DM Reg
ALUIM Reg DM Reg
Inst 3
lw
ALUIM Reg DM Reg
ALUIM Reg DM Reg
Can fix branch hazard by waiting –stall – but affects throughput
stall
stall
Spring 2005331 W11.23
Register Usage Can Cause Data HazardsDependencies backward in time cause hazards
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
add r1,r2,r3
sub r4,r1,r5
and r6,r1,r7
xor r4,r1,r5
or r8, r1, r9
Instr.
Order
Spring 2005331 W11.24
One Way to “Fix” a Data Hazard
Instr.
Order
add r1,r2,r3
ALUIM Reg DM Reg
sub r4,r1,r5
and r6,r1,r7
ALUIM Reg DM Reg
ALUIM Reg DM Reg
stall
stall
Can fix data hazard by waiting – stall –but affects throughput
Spring 2005331 W11.25
Loads Can Cause Data HazardsDependencies backward in time cause hazards
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
lw r1,100(r2)
sub r4,r1,r5
and r6,r1,r7
xor r4,r1,r5
or r8, r1, r9
Instr.
Order
Spring 2005331 W11.26
Stores Can Cause Data HazardsDependencies backward in time cause hazards
add r1,r2,r3
sw r1,100(r5)
and r6,r1,r7
xor r4,r1,r5
or r8, r1, r9A
LUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
ALUIM Reg DM Reg
Instr.
Order
Spring 2005331 W11.27
Other Pipeline Structures Are PossibleWhat about (slow) multiply operation?
let it take two cycles
What if the data memory access is twice as slow as the instruction memory?
make the clock twice as slow or …let data memory access take two cycles (and keep the same clock rate)
ALUIM Reg DM Reg
MUL
ALUIM Reg DM1 RegDM2
Spring 2005331 W11.28
Sample Pipeline AlternativesARM7
StrongARM-1
XScale
IM Reg EX
PC updateIM access
decodereg
access
ALU opDM accessshift/rotatecommit result
(write back)
ALUIM Reg DM Reg
ALUIM1 IM2 DM1 Reg
DM2Reg SHFT
PC updateBTB access
start IM access
IM access
decodereg 1 access
shift/rotatereg 2 access
ALU op
start DM accessexception
DM writereg write
Spring 2005331 W11.29
SummaryAll modern day processors use pipelining
Pipelining doesn’t help latency of single task, it helps throughput of entire workload
Multiple tasks operating simultaneously using different resources
Potential speedup = Number of pipe stages
Pipeline rate limited by slowest pipeline stageUnbalanced lengths of pipe stages reduces speedupTime to “fill” pipeline and time to “drain” it reduces speedup
Must detect and resolve hazardsStalling negatively affects throughput