cs 152 computer architecture and engineering lecture …cs152/fa16/lectures/l03-ciscrisc.pdf · cs...

42
9/1/2016 CS152, Fall 2016 CS 152 Computer Architecture and Engineering Lecture 3 - From CISC to RISC John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw http://inst.eecs.berkeley.edu/~cs152

Upload: nguyennga

Post on 31-Mar-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

9/1/2016 CS152,Fall2016

CS152ComputerArchitectureandEngineering

Lecture3- FromCISCtoRISC

JohnWawrzynekElectricalEngineeringandComputerSciences

UniversityofCaliforniaatBerkeley

http://www.eecs.berkeley.edu/~johnwhttp://inst.eecs.berkeley.edu/~cs152

9/1/2016 CS152,Fall2016

LastTimeinLecture2

§ ISAisthehardware/softwareinterface– Definessetofprogrammervisiblestate– Definesinstructionformat(bitencoding)andinstructionsemantics– Examples:IBM360,MIPS,RISC-V,x86,JVM

§ ManypossibleimplementationsofoneISA– 360implementations:model30(c.1964),z12(c.2012)– x86implementations:8086(c.1978),80186,286,386,486,Pentium,PentiumPro,Pentium-4(c.2000),Core2Duo,Nehalem,SandyBridge,IvyBridge,Atom,AMDAthlon,Transmeta Crusoe,SoftPC

– MIPSimplementations:R2000,R4000,R10000,R18K,…– JVM:HotSpot,PicoJava,ARMJazelle,…

§ Microcoding: straightforwardmethodicalwaytoimplementmachinesusinglowlogicgatecountandsimplifiesimplementationofcomplexinstructions

2

9/1/2016 CS152,Fall2016

§ Instructionsperprogramdependsoncompilertechnology,andISA

§ Cyclesperinstructions(CPI)dependsonISAandµarchitecture

§ Timeperclockcycledependsupontheµarchitectureandbasetechnology

3

Time =Instructions ClockCycles TimeProgramProgram*Instruction*ClockCycle

“IronLaw”ofProcessorPerformance

Microarchitecture CPI cycletimeMicrocoded >1 shortSingle-cycleunpipelined 1 longPipelined ~1 short

Thislecture

9/1/2016 CS152,Fall2016

HardwareElements§ Combinationalcircuits

– Mux,Decoder,ALU,...

• Synchronousstateelements– Flipflop,Register,Registerfile,SRAM,DRAM

Edge-triggered:Dataissampledattherisingedge

Clk

D

Q

Enff

Q

D

ClkEn

OpSelect- Add,Sub,...- And,Or,Xor,Not,...- GT,LT,EQ,Zero,...

Result

Comp?

A

B

ALU

Sel

OA0A1

An-1

Mux...

lg(n)

A

Decoder ...

O0O1

On-1

lg(n)

9/1/2016 CS152,Fall2016

RegisterFiles

§ Readsarecombinational

5

ReadData1ReadSel1ReadSel2

WriteSel

Registerfile

2R+1W

ReadData2

WriteData

WEClock

rd1rs1

rs2

ws

wd

rd2

we

ff

Q0

D0

ClkEn

ff

Q1

D1

ff

Q2

D2

ff

Qn-1

Dn-1

...

...

...

register

9/1/2016 CS152,Fall2016

RegisterFileImplementation

§ RISC-Vintegerinstructionshaveatmost2registersourceoperands

6

reg31

rd clk

reg1

wdata

we

rs1rdata1 rdata2

reg0

32

5 32 32

rs255

enables selects

9/1/2016 CS152,Fall2016

ASimpleMemoryModel

7

MAGICRAM

ReadData

WriteData

Address

WriteEnableClock

Readsandwritesarealwayscompletedinonecycle• aReadcanbedoneanytime(i.e.combinational)• aWriteisperformedattherisingclockedgeiff WriteEnable signalisasserted

⇒ thewriteaddressanddatamustbestableattheclockedge

Laterinthecoursewewillpresentamorerealisticmodelofmemory

9/1/2016 CS152,Fall2016

ImplementingRISC-V

Single-cycleperinstructiondatapath &controllogic

(SimilartoMIPSsingle-cycleprocessorinCS61C)

8

9/1/2016 CS152,Fall2016

InstructionExecutionReview

Executionofaninstructioninvolves

1. Instructionfetch2. Decodeandregisterfetch3. ALUoperation4. Memoryoperation(optional)5. Writeback(optional)

andcomputeaddressofnextinstruction

9

9/1/2016 CS152,Fall2016

Datapath:Reg-RegALUInstructions

10

RegWrite Timing?5 5 5 10 7

rd rs1 rs2 func opcode rd ← (rs1) func (rs2)31 27 26 22 21 17 16 7 6 0

0x4Add

clk

addrinst

Inst.Memory

PC

Inst<26:22>Inst<21:17>

Inst<31:27>

Inst<16:0>

OpCode

ALU

ALUControl

RegWriteEn

clk

rd1

GPRs

rs1rs2

wawd rd2

we

9/1/2016 CS152,Fall2016

Datapath:Reg-ImmALUInstructions

11

5 5 12 3 7rd rs1 immediate12 func opcode rd ← (rs1) op immediate

31 27 26 22 21 10 9 7 6 0

ImmSelect

ImmSel

inst<21:10>

OpCode

0x4Add

clk

addrinst

Inst.Memory

PCALU

RegWriteEn

clk

rd1

GPRs

rs1rs2

wawd rd2

weinst<26:22>

inst<31:27>

inst<9:0> ALUControl

9/1/2016 CS152,Fall2016

ConflictsinMergingDatapath

12

ImmSelect

ImmSelOpCode

0x4Add

clk

addrinst

Inst.Memory

PCALU

RegWrite

clk

rd1

GPRs

rs1rs2

wawd rd2

weinst<26:22>

Inst<31:27>

Inst<21:10>

Inst<16:0> ALUControlInst<9:0>

Introducemuxes

rd rs1 immediate12 func3 opcode rd ← (rs1) op immediate

5 5 5 10 7rd rs1 rs2 func10 opcode rd ← (rs1) func (rs2)

Inst<21:17>

9/1/2016 CS152,Fall2016

Datapath forALUInstructions

13

<16:0>

rd rs1 immediate12 func3 opcode rd ← (rs1) op immediate

5 5 5 10 7rd rs1 rs2 func10 opcode rd ← (rs1) func (rs2)

Op2SelReg / Imm

ImmSelect

ImmSelOpCode

0x4Add

clk

addrinst

Inst.Memory

PCALU

RegWriteEnclk

rd1

GPRs

rs1rs2

wawd rd2

we<26:22><21:17>

FuncSel

ALUControl

<31:27>

<6:0>

9/1/2016 CS152,Fall2016

Load/StoreInstructions

14

WBSelALU / Mem

rs1 is the base registerrd is the destination of a Load, rs2 is the data source for a Store

Op2Sel

“base”

disp

ImmSelOpCode FuncSel

ALUControl

ALU

0x4Add

clk

addrinst

Inst.Memory

PC

RegWriteEn

clk

rd1

GPRs

rs1rs2

wawd rd2

we

ImmSelect

clk

MemWrite

addr

wdata

rdataData Memory

we

rd rs1 immediate12 func3 opcode Load

5 5 5 7 3 7 Addressing Modeimm rs1 rs2 imm func3 opcode Store (rs) + displacement

9/1/2016 CS152,Fall2016

RISC-VConditionalBranches

§ Comparetwointegerregistersforequality(BEQ/BNE)orrelativevalue(signed)(BLT/BGE)orunsigned(BLTU/BGEU)

§ 12-bitimmediateencodesbranchtargetaddressasasignedoffsetfromPC,inunitsof16-bits(i.e.,shiftleftby1thenaddtoPC).

15

7

6 0opcode

3

9 7func3

7

16 10imm[6:0]

5

21 17rs2

5

26 22rs1

5

31 27imm[11:7]

BEQ/BNE

BLT/BGE

BLTU/BGEU

9/1/2016 CS152,Fall2016

ConditionalBranches(BEQ/BNE/BLT/BGE/BLTU/BGEU)

16

0x4

Add

PCSel

clk

WBSelMemWrite

addr

wdata

rdataData Memory

we

Op2SelImmSelOpCode

Bcomp?

FuncSel

clk

clk

addrinst

Inst.Memory

PC rd1

GPRs

rs1rs2

wawd rd2

we

ImmSelect

ALU

ALUControl

Add

br

pc+4

RegWrEn

Br Logic

9/1/2016 CS152,Fall2016

IncludingJumpandJalr

17

0x4

RegWriteEn

AddAdd

clk

WBSelMemWrite

addr

wdata

rdataData Memory

we

WASel Op2SelImmSelOpCode FuncSel

clk

clk

addrinst

Inst.Memory

PC rd1

GPRs

rs1rs2

wawd rd2

we

ImmSelect

ALU

ALUControl

1

PCSelbrrindjabspc+4

Bcomp?Br Logic

9/1/2016 CS152,Fall2016

HardwiredControlispureCombinationalLogic

18

combinationallogic

opcode

Equal?

ImmSelOp2SelFuncSelMemWriteWBSelWASelRegWriteEnPCSel

9/1/2016 CS152,Fall2016

ALUControl&ImmediateExtension

19

Inst<6:0> (Opcode)

Decode Map

Inst<16:7> (Func)

ALUop+

FuncSel( Func, Op, +)

ImmSel( IType12, BsType12,

BrType12)

9/1/2016 CS152,Fall2016

HardwiredControlTable

20

Opcode ImmSel Op2Sel FuncSel MemWr RFWen WBSel WASel PCSel

ALUALUiLWSWBEQtrue

BEQfalse

JJALJALR

Op2Sel=Reg /Imm WBSel =ALU/Mem /PCWASel =rd /X1 PCSel =pc+4/br /rind/jabs

* * * no yes rindPC rdjabs* * * no yes PC X1

jabs* * * no no * *pc+4BrType12 * * no no * *brBrType12 * * no no * *pc+4BsType12 Imm + yes no * *

pc+4* Reg Func no yes ALU rdIType12 Imm Op pc+4no yes ALU rd

pc+4IType12 Imm + no yes Mem rd

9/1/2016 CS152,Fall2016

RISC-VUnconditional Jumps

§ 25-bitimmediateencodesjumptargetaddressasasignedoffsetfromPC,inunitsof16-bits(i.e.,shiftleftby1thenaddtoPC).(+/- 16MB)

§ JALisasubroutinecallthatalsosavesreturnaddress(PC+4)inregisterx1

21

J

JAL

7

6 0opcode

25

31 7JumpOffset[24:0]

9/1/2016 CS152,Fall2016

RISC-VRegisterIndirectJumps

§ Jumpstotargetaddressgivenbyadding12-bitoffset(notshiftedby1bit)toregisterrs1.PC←RF[rs1]+sign-ext(Imm)

§ Thereturnaddress(PC+4)iswrittentord(canbex0 ifvaluenotneeded)

§ TheRDNPCinstructionsimplywritesreturnaddresstoregisterrdwithoutjumping(usedfordynamiclinking)

22

7

6 0opcode

3

9 7func3

12

21 10Imm[11:0]

5

26 22rs1

JALR

RDNPC

5

31 27rd

9/1/2016 CS152,Fall2016

FullRISCV1StageDatapath (Lab1)

23

Note: Ref File shown twice for clarity.Immediate select changed.

9/1/2016 CS152,Fall2016

Single-CycleHardwiredControl

Wewillassumeclockperiodissufficientlylongforallofthefollowingstepstobe“completed”:1. Instructionfetch2. Decodeandregisterfetch3. ALUoperation4. Datafetchifrequired5. Registerwrite-backsetuptime

⇒ tC >tIFetch +tRFetch +tALU+tDMem+tRWB

Attherisingedgeofthefollowingclock,thePC,registerfileandmemoryareupdated

24

9/1/2016 CS152,Fall2016

§ Instructionsperprogramdependsonsourcecode,compilertechnology,andISA

§ Cyclesperinstructions(CPI)dependsonISAandµarchitecture

§ Timepercycledependsupontheµarchitectureandbasetechnology

25

Time =Instructions Cycles TimeProgramProgram*Instruction*Cycle

“IronLaw”ofProcessorPerformance

9/1/2016 CS152,Fall2016

Inst3

CPIforMicrocodedMachine

26

7cycles

Inst1 Inst2

5cycles 10cycles

Totalclockcycles=7+5+10=22

Totalinstructions=3

CPI=22/3=7.33

CPIisalwaysanaverageoveralargenumberofinstructions.

Time

9/1/2016 CS152,Fall2016

TechnologyInfluence

§Whenmicrocodeappearedin50s,differenttechnologiesfor:– Logic:VacuumTubes– MainMemory:Magneticcores– Read-OnlyMemory:Diodematrix,punchedmetalcards,…

§ LogicveryexpensivecomparedtoROMorRAM§ ROMcheaperthanRAM§ ROMmuchfasterthanRAM

27

Butseventiesbroughtadvancesinintegratedcircuittechnologyandsemiconductormemory…

9/1/2016 CS152,Fall2016

FirstMicroprocessorIntel4004,1971

§ 4-bitaccumulatorarchitecture

§ 8µmpMOS§ 2,300transistors§ 3x4mm2§ 750kHzclock§ 8-16cycles/inst.

28

Madepossiblebynewintegratedcircuittechnology

9/1/2016 CS152,Fall2016

Microprocessors intheSeventies

§ Initialtargetwasembeddedcontrol– Firstmicro,4-bit4004fromIntel,designedforadesktopprintingcalculator

– Constrainedbywhatcouldfitonsinglechip– Accumulatorarchitectures,similartoearliestcomputers– Hardwiredstatemachinecontrol

§ 8-bitmicros(8085,6800,6502)usedinhobbyistpersonalcomputers– Micral,Altair,TRS-80,Apple-II– Usuallyhad16-bitaddressspace(upto64KBdirectlyaddressable)

– OftencamewithsimpleBASIClanguageinterpreterbuiltintoROMorloadedfromcassettetape.

29

9/1/2016 CS152,Fall2016

VisiCalc– thefirst“killer”appformicros• MicroprocessorshadlittleimpactonconventionalcomputermarketuntilVisiCalcspreadsheetforApple-II• Apple-IIusedMostek 6502microprocessorrunningat1MHz

30[PersonalComputingAd,1979]

FloppydiskswereoriginallyinventedbyIBMasawayofshippingIBM360microcodepatchestocustomers!

9/1/2016 CS152,Fall2016

DRAMintheSeventies

§ Dramaticprogressinsemiconductormemorytechnology

§ 1970,IntelintroducesfirstDRAM,1Kbit1103

§ 1979,Fujitsuintroduces64KbitDRAM

=>Bymid-Seventies,obviousthatPCswouldsoonhave>64KBytesphysicalmemory

31

9/1/2016 CS152,Fall2016

MicroprocessorEvolution

§ Rapidprogressin70s,fueledbyadvancesinMOSFETtechnologyandexpandingmarkets

§ Inteli432– Mostambitiousseventies’micro;startedin1975- released1981– 32-bitcapability-basedobject-orientedarchitecture– Instructionsvariablenumberofbitslong– Severeperformance,complexity,andusabilityproblems

§ Motorola68000(1979,8MHz,68,000transistors)– Heavilymicrocoded (andnanocoded)– 32-bitgeneral-purposeregisterarchitecture(24addresspins)– 8addressregisters,8dataregisters

§ Intel8086(1978,8MHz,29,000transistors)– “Stopgap”16-bitprocessor,architectedin10weeks– Extendedaccumulatorarchitecture,assembly-compatiblewith8080– 20-bitaddressingthroughsegmentedaddressingscheme

32

9/1/2016 CS152,Fall2016

IBMPC,1981

§ Hardware– TeamfromIBMbuildingPCprototypesin1979– Motorola68000choseninitially,but68000waslate– IBMbuilds“stopgap”prototypesusing8088boardsfromDisplayWriterwordprocessor

– 8088is8-bitbusversionof8086=>allowscheapersystem– Estimatedsalesof250,000– 100,000,000ssold

§ Software– MicrosoftnegotiatestoprovideOSforIBM.LaterbuysandmodifiesQDOSfromSeattleComputerProducts.

§ OpenSystem– Standardprocessor,Intel8088– Standardinterfaces– StandardOS,MS-DOS– IBMpermitscloningandthird-partysoftware

33

9/1/2016 CS152,Fall2016 34

[ Personal Computing Ad, 11/81]

9/1/2016 CS152,Fall2016

Microprogramming:earlyEighties

§ Evolutionbredmorecomplexmicro-machines– Complexinstructionsetsledtoneedforsubroutineandcallstacksinµcode

– Needforfixingbugsincontrolprogramswasinconflictwithread-onlynatureofµROM

– èWritableControlStore(WCS)(B1700,QMachine,Inteli432,…)

§ WiththeadventofVLSItechnologyassumptionsaboutROM&RAMspeedbecameinvalidàmorecomplexity

§ Bettercompilersmadecomplexinstructionslessimportant.

§ Useofnumerousmicro-architecturalinnovations,e.g.,pipelining,cachesandbuffers,mademultiple-cycleexecutionofreg-reginstructionsunattractive

35

9/1/2016 CS152,Fall2016

AnalyzingMicrocodedMachines

§ JohnCocke andgroupatIBM– Workingonasimplepipelinedprocessor,801,andadvancedcompilersinsideIBM

– PortedexperimentalPL.8compilertoIBM370,andonlyusedsimpleregister-registerandload/storeinstructionssimilarto801

– Coderanfasterthanotherexistingcompilersthatusedall370instructions!(upto6MIPSwhereas2MIPSconsideredgoodbefore)

§ Emer,Clark,atDEC– MeasuredVAX-11/780usingexternalhardware– Founditwasactuallya0.5MIPSmachine,althoughusuallyassumedtobea1MIPSmachine

– Found20%ofVAXinstructionsresponsiblefor60%ofmicrocode,butonlyaccountfor0.2%ofexecutiontime!

§ VAX8800– ControlStore:16K*147bRAM,UnifiedCache:64K*8bRAM– 4.5xmoremicrostore RAMthancacheRAM!

36

9/1/2016 CS152,Fall2016

ICTechnologyChangesTradeoffs

§ Logic,RAM,ROMallimplementedusingMOStransistors§ SemiconductorRAM~samespeedasROM

37

9/1/2016 CS152,Fall2016

Nanocoding

38

µcodeROM

nanoaddress

µcodenext-state

µaddress

uPC (state)

nanoinstructionROMdata

Exploitsrecurringcontrolsignalpatternsinµcode,e.g.,

ALU0 A←Reg[rs1]...ALUi0 A←Reg[rs1]...

9/1/2016 CS152,Fall2016

FromCISCtoRISC

§ UsefastRAMtobuildfastinstructioncache ofuser-visibleinstructions,notfixedhardwaremicroroutines– Contentsoffastinstructionmemorychangetofitwhatapplicationneedsrightnow

§ UsesimpleISAtoenablehardwiredpipelinedimplementation– MostcompiledcodeonlyusedafewoftheavailableCISCinstructions– Simplerencodingallowedpipelinedimplementations

§ Furtherbenefitwithintegration– Inearly‘80s,couldfinallyfit32-bitdatapath +smallcachesonasinglechip

– Nochipcrossingsincommoncaseallowsfasteroperation

39

9/1/2016 CS152,Fall2016

BerkeleyRISCChips

40

RISC-I(1982)Contains44,420transistors,fabbed in5µm NMOS,withadieareaof77mm2,ranat1MHz.ThischipisprobablythefirstVLSIRISC.

RISC-II(1983)contains40,760transistors,wasfabbed in3µmNMOS,ranat3MHz,andthesizeis60mm2.

Stanford built some too…

9/1/2016 CS152,Fall2016

Summary

§ Microcoding becamelessattractiveasgapbetweenRAMandROMspeedsreduced,andlogicimplementedinsametechnologyasmemory

§ Complexinstructionsetsdifficulttopipeline,sodifficulttoincreaseperformanceasgatecountgrew

§ IronLawexplainsarchitecturedesignspace– Tradeinstruction/program,cycles/instruction,andtime/cycle

§ Load-StoreRISCISAsdesignedforefficientpipelinedimplementations– Verysimilartoverticalmicrocode– InspiredbyearlierCraymachines(CDC6600/7600)

§ RISC-VISAwillbeusedinlectures,problems,andlabs– BerkeleyRISCchips:RISC-I,RISC-II,SOAR(RISC-III),SPUR(RISC-IV)

41

9/1/2016 CS152,Fall2016

Acknowledgements

§ Theseslidescontainmaterialdevelopedandcopyrightby:– Arvind (MIT)– KrsteAsanovic(MIT/UCB)– JoelEmer (Intel/MIT)– JamesHoe(CMU)– JohnKubiatowicz (UCB)– DavidPatterson(UCB)

§ MITmaterialderivedfromcourse6.823§ UCBmaterialderivedfromcourseCS252

42