lecture 08: risc-v pipeline implementation€¦ · lecture 08: risc-v pipeline implementation csce...

56
Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong Yan [email protected] https://passlab.github.io/CSCE513 1

Upload: others

Post on 30-Jul-2020

36 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

Lecture08:RISC-VPipelineImplementation

CSCE513ComputerArchitecture

DepartmentofComputerScienceandEngineeringYonghong Yan

[email protected]://passlab.github.io/CSCE513

1

Page 2: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

Acknowledgement

• SlidesadaptedfromComputerScience152:ComputerArchitectureandEngineering,Spring2016byDr.GeorgeMichelogiannakis fromUCBerkeley

2

Page 3: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

Review

• CPUperformancefactors– Instructioncount• DeterminedbyISAandcompiler

– CPIandCycletime• DeterminedbyCPUhardware

• Threegroupsofinstructions– Memoryreference:lw,sw– Arithmetic/logical:add,sub,and,or,slt– Controltransfer:jal,jalr,b*

• CPI– Single-cycle,CPI=1,andnormallylongercycle– 5stageunpipelined,CPI=5– 5stagepipelined,CPI=1

CPU Time = InstructionsProgram

* CyclesInstruction

*TimeCycle

3

Page 4: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

Review:Unpipelined Datapath forRISC-V

4

0x4

RegWriteEn

AddAdd

clk

WBSelMemWrite

addr

wdata

rdataDataMemory

we

WASel Op2SelImmSelOpCode

clk

clk

addrinst

Inst.Memory

PC rd1

GPRs

rs1rs2

wawd rd2

we

ImmSelect

ALU

ALUControl

PCSelbrrindjabspc+4

Bcomp?BrLogic

Page 5: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

Review:HardwiredControlTable

5

Opcode ImmSel Op2Sel FuncSel MemWr RFWen WBSel WASel PCSel

ALUALUiLWSWBEQtrueBEQfalseJJALJALR

Op2Sel=Reg /Imm WBSel =ALU/Mem /PCWASel =rd /X1 PCSel =pc+4/br /rind/jabs

* * * no yes rindPC rdjabs* * * no yes PC X1

jabs* * * no no * *pc+4SBType12 * * no no * *brSBType12 * * no no * *pc+4SType12 Imm + yes no * *

pc+4* Reg Func no yes ALU rdIType12 Imm Op pc+4no yes ALU rd

pc+4IType12 Imm + no yes Mem rd

Page 6: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

AnIdealPipeline

• Allobjectsgothroughthesamestages• Nosharingofresourcesbetweenanytwostages• Propagationdelaythroughallpipelinestagesisequal• Theschedulingofanobjectenteringthepipelineisnotaffectedbytheobjectsinotherstages

6

stage1

stage2

stage3

stage4

TheseconditionsgenerallyholdforindustrialassemblylinesForlaundrypipeline,twoloadsdonotdependoneachother.

Butinstructionsdependoneachother!

Page 7: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

TechnologyAssumptions

7

Thus,thefollowingtimingassumptionisreasonable

• Asmallamountofveryfastmemory(caches)backedupbyalarge,slowermemory

• FastALU(atleastforintegers)• Multiported Registerfiles(slower)

tIF ~=tID/RF ~=tEX ~=tMEM ~=tWB

A5-stagepipelinewillbe focusofourdetaileddesignSomecommercialdesignshaveover30pipelinestagestodoanintegeradd!

Page 8: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

5-StagePipelinedExecution:ResourceUsageTheWholePipelineResourcesareUsedby5InstructionsinEveryCycle!

8

time t0 t1 t2 t3 t4 t5 t6 t7 ....Instruction1 IF1 ID1 EX1 ME1 WB1Instruction2 IF2 ID2 EX2 ME2 WB2Instruction3 IF3 ID3 EX3 ME3 WB3Instruction4 IF4 ID4 EX4 ME4 WB4Instruction5 IF5 ID5 EX5 ME5 WB5

Write-Back

(WB)- I1I-Fetch(IF)- I5

Execute(EX)- I3

Decode,Reg.Fetch(ID)- I4

Memory(ME)- I2

addr

wdata

rdataDataMemory

weALU

ImmSelect

0x4Add

addrrdata

Inst.Memory

rd1

GPRs

rs1rs2wawdrd2

we

IRPC

Page 9: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

InstructionRegisterinEachStage

9

IRIR IR

PC A

BY

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR

ImmSelect

ALUrd1

GPRs

rs1rs2

wawd rd2

we

wdata

addr

wdata

rdataDataMemory

we

• AninstructionReg (IR) ineachstagetocontaintheinstructioninthatstage

Instruction1

Page 10: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

ConnectControlsfromInstructionRegister

10

IRIR IR

PC A

BY

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR

ImmSelect

ALUrd1

GPRs

rs1rs2

wawd rd2

we

DataMemorywdata

addr

wdata

rdata

we

ImmSel Op2Sel

WBSelMemWrite

RegWriteEn

F D E M W

ALUControl

Instruction1

Instruction4Instruction3

Instruction2

Page 11: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

ComparedWithControlLogicinUnpipelined

• Unpipelined:– Singlecontrollogicuses

theinstructionfromIF• Pipelined:– Distributedlogicsthat

uses instructionsfromIRsineachstage

11

Page 12: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

InstructionsInteractWithEachOtherin Pipeline:DealingwithHazards

• Aninstructionmayneedaresourcebeingusedbyanotherinstructionà structuralhazard– Solution#1:Stallingnewerinstructiontillolderinstructionfinishes– Solution#2:Addingmorehardwaretodesign• E.g.,separatememoryintoI-memoryandD-memory

– Our5-stagepipelinehasnostructuralhazardsbydesign

• Aninstructiondependsonsomethingproducedbyanearlierone– Dependencemaybeforadatavalueorforusingsameregister(notthe

value)àdatahazard• SolutionsforRAWhazards:#1,interlocking(bubbledelay),and#2,forwarding

• WARandWRWhazards:notpossiblefor5-stagepipeline

– Dependencemaybeforthenextinstruction’saddressà controlhazard(branches,exceptions)• Solutions:#delay,prediction,etc

12

Page 13: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

Read-After-Write(RAW)DataHazards

13

...x1 ¬ x0 + 10x4 ¬ x1 + 17...

x1 in GPRs contains stale value since the passing of value between two instructions has to go through GPRs (register file).

x1 ¬ …x4 ¬ x1 …

IRIR IR

PCA

B

Y

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR

ImmSelect

ALUrd1

GPRs

rs1rs2

wawd rd2

we

wdata

addr

wdata

rdataData Memory

we

Page 14: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

ToResolveDataHazards:#1,Interlocking,i.e.StallPipelinebyInsertingBubbles

14

stalled stages

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

IF I1 I2 I3 I3 I3 I3 I4 I5ID I1 I2 I2 I2 I2 I3 I4 I5EX I1 - - - I2 I3 I4 I5ME I1 - - - I2 I3 I4 I5WB I1 - - - I2 I3 I4 I5

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

(I1) x1 ¬ (x0) + 10IF1 ID1 EX1 ME1 WB1(I2) x4 ¬ (x1) + 17 IF2 ID2 ID2 ID2 ID2 EX2 ME2 WB2(I3) IF3 IF3 IF3 IF3 ID3 EX3 ME3 WB3(I4) IF4 ID4 EX4 ME4 WB4(I5) IF5 ID5 EX5 ME5 WB5

Resource Usage

- Þ pipeline bubble

Page 15: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

InsertBubbleforInterlocking

15

IRIR IR

PCA

B

Y

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR

ImmSelect

ALUrd1

GPRs

rs1rs2

wawd rd2

we

wdata

addr

wdata

rdataData Memory

we

bubble

...x1 ¬ x0 + 10x4 ¬ x1 + 17...

Stall Condition Bubble:Nop instruction,e.g.addx0,x0,x0BubbleareinsertedatIDandbyenteringEXEstage

Page 16: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

InterlockControlLogic

16

IRIR IR

PCA

B

Y

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR

ImmSelect

ALUrd1

GPRs

rs1rs2

wawd rd2

we

wdata

addr

wdata

rdataData Memory

we

bubble

Compare the source registers of the instruction in the decode stage with the destination register of the uncommittedinstructions.

stallCstall

ws

rs2rs1 ?

...x1 ¬ x0 + 10x4 ¬ x1 + 17...

Page 17: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

InterlockControlLogicignoringjumps&branches

17

Shouldwealwaysstallif anrs fieldmatchessomerd?

IRIR IR

PC A

BY

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR ALUrd1

GPRs

rs1rs2

wawd rd2

we

wdata

addr

wdata

rdataDataMemory

we

bubble

stallCstall

wsWB

rs1rs2 ?

weWB

re1 re2Cre

wsEXweMEM wsMEM

Cdest CdestweEX

noteveryinstructionwritesaregister=>wenoteveryinstructionreadsaregister=>re

ImmSelect

we:writeenable,1-biton/offws:writeselect,5-bitregisternumberre:readenable,1-biton/offrs:readselect,5-bitregisternumber

...x1 ¬ x0 + 10x4 ¬ x1 + 17...

Page 18: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

Source&DestinationRegisters

18

ALUI/LW/JALRALU

SW/Bcond

func7 rs2 rs1 func3 rd opcode

immediate12 rs1 func3 rd opcode

imm rs2 rs1 func3 imm

Jump Offset[19:0]

opcode

rd opcode

source(s) destinationALU rd <=rs1func10rs2 rs1,rs2 rdALUI rd <=rs1opimm rs1 rdLW rd <=M[rs1+imm] rs1 rdSW M[rs1+imm]<=rs2 rs1,rs2 -Bcond rs1,rs2 rs1,rs2 -

true: PC<=PC+immfalse: PC<=PC+4

JAL x1<=PC,PC<=PC+imm - rdJALR rd <=PC,PC<=rs1+imm rs1 rd

Page 19: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

DerivingtheStallSignal

19

Cdestws =rdwe=Caseopcode

ALU,ALUi,LW,JALR=>on... =>off

Crere1=Caseopcode

ALU,ALUi,LW,SW,Bcond,JALR=>onJAL=>off

re2=CaseopcodeALU,SW,Bcond =>on… =>off

Cstall stall=((rs1D==wsEX)&&weEX +(rs1D==wsMEM)&&weMEM +(rs1D==wsWB)&&weWB)&&re1D +((rs2D==wsEX)&&weEX +(rs2D==wsMEM)&&weMEM +(rs2D==wsWB)&&weWB)&&re2D

source(s) destinationALU rd <=rs1func10rs2 rs1,rs2 rdALUI rd <=rs1opimm rs1 rdLW rd <=M[rs1+imm] rs1 rdSW M[rs1+imm]<=rs2 rs1,rs2 -Bcond rs1,rs2 rs1,rs2 -

true: PC<=PC+immfalse: PC<=PC+4

JAL x1<=PC,PC<=PC+imm - rdJALR rd <=PC,PC<=rs1+imm rs1 rd

NoneedtheWBforinterlockcontrolsinceweonlyneedtodealwithhazardbetweenMEM-EXEandEXE-EXE.FortwoinstructionswhichareinWBandEXE,andhaveRAWhazard,thedependencyarehandledthroughtheregisterfile.

Page 20: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

ToResolveDataHazards:#2,Forwarding(Bypassing)

20

Eachstallorkillintroducesabubbleinthepipeline=>CPI>1

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .(I1) x1 ¬ x0 + 10 IF1 ID1 EX1 ME1 WB1(I2) x4 ¬ x1 + 17 IF2 ID2 ID2 ID2 ID2 EX2 ME2 WB2(I3) IF3 IF3 IF3 IF3 ID3 EX3 ME3(I4) stalled stages IF4 ID4 EX4(I5) IF5 ID5

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .(I1) x1 ¬ x0 + 10 IF1 ID1 EX1 ME1 WB1(I2) x4 ¬ x1 + 17 IF2 ID2 EX2 ME2 WB2(I3) IF3 ID3 EX3 ME3 WB3(I4) IF4 ID4 EX4 ME4 WB4(I5) IF5 ID5 EX5 ME5 WB5

Anewdatapath,i.e.,abypass,cangetthedatafromtheoutputoftheALUtoitsinput

Page 21: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

Review:HardwareSupportforForwarding,andDetectingRAWHazardswithPreviousand2nd PreviousInstructions

• Slide48oflecture05-06

21

Page 22: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

AddingaBypass(ToBypassRegisterFiles)

22

ASrc

...(I1) x1<=x0+10(I2) x4<=x1+17

x4<=x1... x1<=...

IRIR IR

PC A

BY

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR

ImmSelect

ALUrd1

GPRs

rs1rs2

wawd rd2

we

wdata

addr

wdata

rdataDataMemory

we

bubble

stall

D

E M W

Whendoesthisbypasshelp?x1<=M[x0+10]x4<=x1+17

JAL500x4<=x1+17

Yes No,LoadàEXE-Use No

Page 23: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

TheBypassSignal:DerivingitfromtheStallSignal

23

ASrc =(rs1D==wsE)&&weE &&re1D

we=CaseopcodeALU,ALUi,LW,,JALJALR=>on...=>off

NobecauseonlyALUandALUi instructionscanbenefitfromthisbypass

Isthiscorrect?

SplitweE intotwocomponents:we-bypass,we-stall

stall=(((rs1D==wsE)&&weE +(rs1D==wsM)&&weM +(rs1D==wsW)&&weW)&&re1D+((rs2D==wsE)&&weE +(rs2D==wsM)&&weM +(rs2D==wsW)&&weW)&&re2D)

ws =rd

Page 24: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

BypassandStallSignals

24

we-bypassE =CaseopcodeEALU,ALUi =>on... =>off

ASrc =(rs1D==wsE)&&we-bypassE &&re1D

SplitweE intotwocomponents:we-bypass,we-stall

stall= ((rs1D==wsE)&&we-stallE +(rs1D==wsM)&&weM +(rs1D==wsW)&&weW) &&re1D

+((rs2D==wsE)&&weE +(rs2D==wsM)&&weM +(rs2D==wsW)&&weW) &&re2D

we-stallE =CaseopcodeELW,JAL,JALR=>onJAL =>on... =>off

Page 25: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

FullyBypassedDatapath

25

ASrcIRIR IR

PC A

BY

R

MD1 MD2

addrinst

InstMemory

0x4Add

IR ALU

ImmSelect

rd1

GPRs

rs1rs2

wawd rd2

we

wdata

addr

wdata

rdataDataMemory

we

bubble

stall

D

E M W

PCforJAL,...

BSrc

Istherestillaneedforthestallsignal? stall=(rs1D==wsE) &&(opcodeE==LWE)&&(wsE!=0 )&&re1D

+(rs2D==wsE)&&(opcodeE==LWE)&&(wsE!=0 )&&re2D

Page 26: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

ControlHazards:BranchesandJumps

• JAL:unconditionaljumptoPC+immediate

• JALR:indirectjumptors1+immediate

• Branch:if(rs1conds rs2),branchtoPC+immediate

26

Page 27: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

InfoforControlTransfer

27

Instruction Takenknown? Targetknown?

JALJALRB<cond.>

Twopiecesofinfo:1.TakenorNotTaken2.Targetaddress?

• JAL:unconditionaljumptoPC+immediate• JALR:indirectjumptors1+immediate• Branch:if(rs1conds rs2),branchtoPC+immediate

AfterInst.Decode

AfterInst.Decode AfterInst.Decode

AfterInst.Decode AfterReg.Fetch

After Execute

Page 28: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

SpeculateNextAddressisPC+4

28

A jump instruction kills (not stalls)the following instruction

stall

How?

I2

I1

104

IR IR

PC addrinst

InstMemory

0x4Add

bubble

IR

E MAdd

Jump?

PCSrc (pc+4 / jabs / rind/ br)

I1 096 ADD I2 100 J 304I3 104 ADD

I4 304 ADD

kill

Page 29: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

PipeliningJumps

29

I2

I1

104

stall

IR IR

PC addrinst

InstMemory

0x4Add

bubble

IR

E MAdd

Jump?

PCSrc (pc+4 / jabs / rind/ br)

IRSrcD = Case opcodeDJAL Þ bubble... Þ IM

To kill a fetched instruction -- Insert a mux before IR

Any interaction between stall and jump?

bubble

IRSrcD

I2 I1

304bubble

I1 096 ADD I2 100 J 304I3 104 ADD

I4 304 ADD

kill

Page 30: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

JumpPipelineDiagrams

30

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

IF I1 I2 I3 I4 I5ID I1 I2 - I4 I5EX I1 I2 - I4 I5ME I1 I2 - I4 I5WB I1 I2 - I4 I5

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

(I1) 096: ADD IF1 ID1 EX1 ME1 WB1(I2) 100: J 304 IF2 ID2 EX2 ME2 WB2(I3) 104: ADD IF3 - - - -(I4) 304: ADD IF4 ID4 EX4 ME4 WB4

Resource Usage

- Þ pipeline bubble

Page 31: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

PipeliningConditionalBranches

31

I1 096 ADDI2 100 BEQx1,x2+200I3 104 ADDI4 304 ADD

BEQ?

I2

I1

104

stall

IR IR

PC addrinst

InstMemory

0x4Add

bubble

IR

E MAdd

PCSrc (pc+4/jabs/rind/br)

bubble

IRSrcD

Branchconditionisnotknownuntiltheexecutestage

AYALU

Taken?

Page 32: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

PipeliningConditionalBranches

32

I1 096 ADDI2 100 BEQx1,x2+200I3 104 ADDI4 304 ADD

stall

IR IR

PC addrinst

InstMemory

0x4Add

bubble

IR

E MAdd

PCSrc (pc+4/jabs/rind/br)

bubble

IRSrcD

AYALU

Taken?

Ifthebranchistaken:- Killthetwofollowinginstructions- TheinstructionatthedecodestageisnotvalidÞ stallsignalisnotvalid

I2 I1

108I3

Bcond?

?

Page 33: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

PipeliningConditionalBranches

33

I1: 096 ADDI2: 100 BEQx1,x2+200I3: 104 ADDI4: 304 ADD

stall

IR IR

PC addrinst

InstMemory

0x4Add

bubble

IR

E M

PCSrc (pc+4/jabs/rind/br)

bubble AYALU

Taken?I2 I1

108I3

Bcond?

Jump?

IRSrcD

IRSrcE

Ifthebranchistaken- killthetwofollowinginstructions- theinstructionatthedecodestageisnotvalidÞ stallsignalisnotvalid

Add

PC

Page 34: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

BranchPipelineDiagrams(resolvedinexecutestage)

34

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

IF I1 I2 I3 I4 I5ID I1 I2 I3 - I5EX I1 I2 - - I5ME I1 I2 - - I5WB I1 I2 - - I5

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

(I1) 096: ADD IF1 ID1 EX1 ME1 WB1(I2) 100: BEQ +200 IF2 ID2 EX2 ME2 WB2(I3) 104: ADD IF3 ID3 - - -(I4) 108: IF4 - - - -(I5) 304: ADD IF5 ID5 EX5 ME5 WB5

Resource Usage

- Þ pipeline bubble

Page 35: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

UseSimplerBranches:E.g.OnlyCompareOneRegisterAgainstZeroinIDStage

35

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

IF I1 I2 I3 I4 I5ID I1 I2 - I4 I5EX I1 I2 - I4 I5ME I1 I2 - I4 I5WB I1 I2 - I4 I5

timet0 t1 t2 t3 t4 t5 t6 t7 . . . .

(I1) 096: ADD IF1 ID1 EX1 ME1 WB1(I2) 100: BEQZ +200 IF2 ID2 EX2 ME2 WB2(I3) 104: ADD IF3 - - - -(I4) 300: ADD IF4 ID4 EX4 ME4 WB4

Resource Usage

- Þ pipeline bubble

Page 36: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

PipelinedMIPSDatapath

Adder

IF/ID

MemoryAccess

WriteBack

InstructionFetch

Instr. DecodeReg. Fetch

ExecuteAddr. Calc

ALU

Mem

ory

Reg File

MU

X

Data

Mem

ory

MU

X

SignExtend

Zero?

MEM

/WB

EX/M

EM

4

Adder

Next SEQ PC

RD RD RD

Next PC

Address

RS1

RS2

Imm

MU

X

ID/EX

(I1) 096: ADD(I2) 100: BEQZ +200 (I3) 104: ADD(I4) 300: ADD

36

Page 37: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

ControlHazardDelaySummary

• JAL:unconditionaljumptoPC+immediate– 1cycledelayofpipeline• JALR:indirectjumptors1+immediate– 1cycledelay• Branch:if(rs1conds rs2),branchtoPC+immediate– 2cyclesdelay– 1cycledelayforsimplerbranch(BEQZ)withpipeline

improvement

37

Page 38: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

ReducingControlFlowPenalty

• Softwaresolutions– Eliminatebranches- loopunrolling• Increasestherunlength

– Reduceresolutiontime- instructionscheduling• Computethebranchconditionasearlyaspossible(oflimitedvaluebecausebranchesoftenincriticalpaththroughcode)

• Hardwaresolutions– Findsomethingelsetodo- delayslots• Replacespipelinebubbleswithusefulwork(requiressoftwarecooperation)

– Speculate- branchprediction• Speculativeexecutionofinstructionsbeyondthebranch

38

Page 39: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

AdditionalMaterials– BranchPrediction

39

Page 40: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

BranchPrediction

• Motivation– Branchpenaltieslimitperformanceofdeeplypipelinedprocessors– Modernbranchpredictorshavehighaccuracy– (>95%)andcanreducebranchpenaltiessignificantly

• Requiredhardwaresupport:– Predictionstructures:• Branchhistorytables,branchtargetbuffers,etc.

• Mispredict recoverymechanisms:– Keepresultcomputationseparatefromcommit– Killinstructionsfollowingbranchinpipeline– Restorestatetothatfollowingbranch

40

Page 41: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

StaticBranchPrediction

41

Overallprobabilityabranchistakenis~60-70%but:

ISAcanattachpreferreddirectionsemanticstobranches,e.g.,MotorolaMC88110

bne0 (preferredtaken) beq0 (nottaken)

backward90%

forward50%

WhatC++statementdoesthislooklike

WhatC++statementdoesthislooklike

Page 42: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

DynamicBranchPredictionlearningbasedonpastbehavior

• Temporalcorrelation(time)– IfItellyouthatacertainbranchwastakenlasttime,doesthishelp?

– Thewayabranchresolvesmaybeagoodpredictorofthewayitwillresolveatthenextexecution

• Spatialcorrelation(space)– Severalbranchesmayresolveinahighlycorrelatedmanner– Forinstance,apreferredpathofexecution

42

Page 43: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

DynamicBranchPrediction

• 1-bitpredictionscheme– Low-portionaddressasaddressforaone-bitflagforTakenor

NotTaken historically– Simple• 2-bitprediction– Misstwicetochange

43

Page 44: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

BranchPredictionBits

• Assume2BPbitsperinstruction• Changethepredictionaftertwoconsecutivemistakes!

44

¬takewrongtaken

¬taken

taken

taken

taken¬takeright

takeright

takewrong

¬taken

¬taken¬taken

BPstate:(predict take/¬take)x(lastprediction right/wrong)

Page 45: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

BranchHistoryTable

45

4K-entryBHT,2bits/entry,~80-90%correctpredictions

0 0FetchPC

Branch? TargetPC

+

I-Cache

Opcode offsetInstruction

kBHTIndex

2k-entryBHT,2bits/entry

Taken/¬Taken?

Page 46: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

ExploitingSpatialCorrelationYehandPatt,1992

46

Iffirstconditionfalse,secondconditionalsofalse

Historyregister,H,recordsthedirectionofthelastNbranchesexecutedbytheprocessor

if (x[i] < 7) theny += 1;

if (x[i] < 5) thenc -= 4;

Page 47: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

Two-LevelBranchPredictor

47

PentiumProusestheresultfromthelasttwobranchestoselectoneofthefoursetsofBHTbits(~95%correct)

0 0

kFetchPC

ShiftinTaken/¬Takenresultsofeachbranch

2-bitglobalbranchhistoryshiftregister

Taken/¬Taken?

Page 48: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

SpeculatingBothDirections• Analternativetobranchpredictionistoexecutebothdirectionsofabranchspeculatively– resourcerequirementisproportionaltothenumberofconcurrent

speculativeexecutions– onlyhalftheresourcesengageinusefulworkwhenbothdirections

ofabranchareexecutedspeculatively– branchpredictiontakeslessresourcesthanspeculativeexecution

ofbothpaths

• Withaccuratebranchprediction,itismorecosteffectivetodedicateallresourcestothepredicteddirection!– Whatwouldyouchoosewith80%accuracy?

48

Page 49: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

AreWeMissingSomething?

• Knowingwhetherabranchistakenornotisgreat,butwhatelsedoweneedtoknowaboutit?

Branchtargetaddress

49

Page 50: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

BranchTargetBuffer

50

BPbitsarestoredwiththepredictedtargetaddress.

IFstage:If(BP=taken)thennPC=targetelsenPC=PC+4Later:checkprediction,ifwrongthenkilltheinstructionandupdateBTB&BPb elseupdateBPb

IMEM

PC

BranchTargetBuffer(2k entries)

k

BPbpredicted

target BP

target

Page 51: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

AddressCollisions(MisPrediction)

51

Whatwillbefetchedaftertheinstructionat1028?BTBprediction =Correcttarget =

=>

Assumea128-entryBTB

BPbtargettake236

1028Add.....

132Jump+104

InstructionMemory

2361032

kill PC=236andfetch PC=1032

Isthisacommonoccurrence?

Page 52: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

BTBisonlyforControlInstructions

• Isevenbranchpredictionfastenoughtoavoidbubbles?• WhendoweindextheBTB?– i.e.,whatstateisthebranchin,inordertoavoidbubbles?

• BTBcontainsusefulinformationforbranchandjumpinstructionsonly=> Donotupdateitforotherinstructions

• ForallotherinstructionsthenextPCisPC+4!

• Howtoachievethiseffectwithoutdecodingtheinstruction?

52

Page 53: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

BranchTargetBuffer(BTB)

53

• KeepboththebranchPCandtargetPCintheBTB• PC+4isfetchedifmatchfails• Onlytaken branchesandjumpsheldinBTB• NextPCdeterminedbefore branchfetchedanddecoded

2k-entry direct-mapped BTB(can also be associative)

I-Cache PC

k

Valid

valid

EntryPC

=

match

predicted

target

targetPC

Page 54: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

CombiningBTBandBHT

• BTBentriesareconsiderablymoreexpensivethanBHT,butcanredirectfetchesatearlierstageinpipelineandcanaccelerateindirectbranches(JR)

• BHTcanholdmanymoreentriesandismoreaccurate

54

A PCGeneration/MuxP InstructionFetchStage1F InstructionFetchStage2B BranchAddressCalc/BeginDecodeI CompleteDecodeJ SteerInstructionstoFunctionalunitsR RegisterFileReadE IntegerExecute

BTB

BHTBHTinlaterpipelinestagecorrectswhenBTBmissesapredictedtakenbranch

BTB/BHTonlyupdatedafterbranchresolvesinEstage

Page 55: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

UsesofJumpRegister(JR)

• Switchstatements(jumptoaddressofmatchingcase)

• Dynamicfunctioncall(jumptorun-timefunctionaddress)

• Subroutinereturns(jumptoreturnaddress)

55

HowwelldoesBTBworkforeachofthesecases?

BTBworkswellifsamecaseusedrepeatedly

BTBworkswellifsamefunctionusuallycalled,(e.g.,inC++programming,whenobjectshavesametypeinvirtualfunctioncall)

BTBworkswellifusuallyreturntothesameplaceÞ Oftenonefunctioncalledfrommanydistinctcallsites!

Page 56: Lecture 08: RISC-V Pipeline Implementation€¦ · Lecture 08: RISC-V Pipeline Implementation CSCE 513 Computer Architecture Department of Computer Science and Engineering Yonghong

SubroutineReturnStack

SmallstructuretoaccelerateJRforsubroutinereturns,typicallymuchmoreaccuratethanBTBs.

56

&fb()

&fc()

Pushcalladdresswhenfunctioncallexecuted

Popreturnaddresswhensubroutinereturndecoded

fa() { fb(); }fb() { fc(); }fc() { fd(); }

&fd() kentries(typicallyk=8-16)