cs 152 computer architecture and engineering lecture 15 ...cs152/sp16/lectures/l15...3/30/2016...
TRANSCRIPT
![Page 1: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/1.jpg)
3/30/2016 CS152,Spring2016
CS152ComputerArchitectureandEngineering
Lecture15:VectorComputers
Dr. George Michelogiannakis
EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory
http://inst.eecs.berkeley.edu/~cs152!
![Page 2: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/2.jpg)
3/30/2016 CS152,Spring2016
LastTimeLecture14:Mul?threading
2
Time (
proc
esso
r cyc
le) Superscalar Fine-Grained Coarse-Grained Multiprocessing
Simultaneous Multithreading
Thread 1 Thread 2
Thread 3 Thread 4
Thread 5 Idle slot
![Page 3: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/3.jpg)
3/30/2016 CS152,Spring2016
Ques?onoftheDay
3
§ CanVectorandVLIWcombine?
![Page 4: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/4.jpg)
3/30/2016 CS152,Spring2016
Supercomputers
§ Defini@onofasupercomputer:§ Fastestmachineinworldatgiventask
– Performsatornearthecurrentlyhighestopera@onalrateforcomputers
§ Adevicetoturnacompute-boundproblemintoanI/Oboundproblem
§ Anymachinecos@ng$30M+§ AnymachinedesignedbySeymourCray
§ CDC6600(Cray,1964)regardedasfirstsupercomputer
4
![Page 5: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/5.jpg)
3/30/2016 CS152,Spring2016
CDC6600SeymourCray,1963
§ Afastpipelinedmachinewith60-bitwords– 128Kwordmainmemorycapacity,32banks
§ Tenfunc@onalunits(parallel,unpipelined)– Floa@ngPoint:adder,2mul@pliers,divider– Integer:adder,2incrementers,...
§ Hardwiredcontrol(nomicrocoding)§ Scoreboardfordynamicschedulingofinstruc@ons§ TenPeripheralProcessorsforInput/Output
– afastmul@-threaded12-bitintegerALU§ Veryfastclock,10MHz(FPaddin4clocks)§ >400,000transistors,750sq.b.,5tons,150kW,novelfreon-basedtechnologyforcooling
§ Fastestmachineinworldfor5years(un@l7600)– over100sold($7-10Meach)
53/10/2009
![Page 6: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/6.jpg)
3/30/2016 CS152,Spring2016
IBMMemoonCDC6600ThomasWatsonJr.,IBMCEO,August1963:
“Lastweek,ControlData...announcedthe6600system.Iunderstandthatinthelaboratorydevelopingthesystemthereareonly34peopleincludingthejanitor.Ofthese,14areengineersand4areprogrammers...ContrasGngthismodesteffortwithourvastdevelopmentacGviGes,IfailtounderstandwhywehavelostourindustryleadershipposiGonbyleIngsomeoneelseoffertheworld'smostpowerfulcomputer.”
TowhichCrayreplied:“ItseemslikeMr.WatsonhasansweredhisownquesGon.”
6
![Page 7: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/7.jpg)
3/30/2016 CS152,Spring2016
Top500Systems
7
LINPACK & LAPACK: Software libraries for performing linear algebra
![Page 8: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/8.jpg)
3/30/2016 CS152,Spring2016
OakRidgeTitan
§ 560,640cores§ LinkPackperformance17,590TFlop/s§ Theore@calpeak27,112.5TFlop/s§ 8,209.00kW§ 710,144GB§ Opteron627416C2.2GHz
8
![Page 9: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/9.jpg)
3/30/2016 CS152,Spring2016
NERSC(LBNL)Cori
§ CrayXC40supercomputer§ [email protected]/sec§ 1,630computesnodes,52,160coresintotal§ CrayArieshigh-speedinterconnectwithDragonflytopologyasonEdison(0.25μsto3.7μsMPIlatency,~8GB/secMPIbandwidth)
§ Aggregatememory:203TB§ Scratchstoragecapacity:30PB
9
![Page 10: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/10.jpg)
3/30/2016 CS152,Spring2016
CDC6600:ALoad/StoreArchitecture
10
• Separate instructions to manipulate three types of reg. 8 60-bit data registers (X)
8 18-bit address registers (A) 8 18-bit index registers (B)
• All arithmetic and logic instructions are reg-to-reg
• Only Load and Store instructions refer to memory!
Touching address registers 1 to 5 initiates a load 6 to 7 initiates a store
- very useful for vector operations
opcode i j k Ri ← (Rj) op (Rk)
opcode i j disp Ri ← M[(Rj) + disp]
6 3 3 3
6 3 3 18
![Page 11: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/11.jpg)
3/30/2016 CS152,Spring2016
CDC6600:Datapath
11
AddressRegsIndexRegs8x18-bit8x18-bit
OperandRegs8x60-bit
Inst.Stack8x60-bit
IR
10Func@onalUnits
CentralMemory
128Kwords,32banks,1µscycle
resultaddr
result
operand
operandaddr
![Page 12: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/12.jpg)
3/30/2016 CS152,Spring2016
CDC6600ISAdesignedtosimplifyhigh-performanceimplementa?on
§ Useofthree-address,register-registerALUinstruc@onssimplifiespipelinedimplementa@on– Noimplicitdependenciesbetweeninputsandoutputs
§ Decouplingseongofaddressregister(Ar)fromretrievingvaluefromdataregister(Xr)simplifiesprovidingmul@pleoutstandingmemoryaccesses– Sobwarecanscheduleloadofaddressregisterbeforeuseofvalue– Caninterleaveindependentinstruc@onsinbetween
§ CDC6600hasmul@pleparallelbutunpipelinedfunc@onalunits– E.g.,2separatemul@pliers
§ Follow-onmachineCDC7600usedpipelinedfunc@onalunits– ForeshadowslaterRISCdesigns
12
![Page 13: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/13.jpg)
3/30/2016 CS152,Spring2016
CDC6600:VectorAddi?on
13
B0<--nloop: JZEB0,exit
A0<-B0+a0 loadX0A1<-B0+b0 loadX1X6<-X0+X1A6<-B0+c0 storeX6B0<-B0+1jumploop
Ai=addressregisterBi=indexregisterXi=dataregister
![Page 14: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/14.jpg)
3/30/2016 CS152,Spring2016
SupercomputerApplica?ons
§ Typicalapplica@onareas– Militaryresearch(nuclearweapons,cryptography)– Scien@ficresearch– Weatherforecas@ng– Oilexplora@on– Industrialdesign(carcrashsimula@on)– Bioinforma@cs– Cryptography
§ Allinvolvehugecomputa@onsonlargedatasets
§ In70s-80s,Supercomputer≡VectorMachine
14
![Page 15: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/15.jpg)
3/30/2016 CS152,Spring2016
VLIWvsVector
15
§ VLIWtakesadvantageofinstruc@onlevelparallelism(ILP)byspecifyinginstruc@onstoexecuteinparallel
§ Vectorarchitecturesperformthesameopera@ononmul@pledataelements– Data-levelparallelism
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic Instructions
ADDV v3, v1, v2 v3
v2 v1
IntOp2 MemOp1 MemOp2 FPOp1 FPOp2IntOp1
![Page 16: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/16.jpg)
3/30/2016 CS152,Spring2016
VectorProgrammingModel
16
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic Instructions
ADDV v3, v1, v2 v3
v2 v1
Scalar Registers
r0
r15 Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLR Vector Length Register
v1 Vector Load and Store Instructions
LV v1, r1, r2
Base, r1 Stride, r2 Memory
Vector Register
![Page 17: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/17.jpg)
3/30/2016 CS152,Spring2016
ControlInforma?on
17
§ VLRlimitsthehighestvectorelementtobeprocessedbyavectorinstruc@on– VLRisloadedpriortoexecu@ngthevectorinstruc@onwithaspecialinstruc@on
§ Strideforload/stores:– Vectorsmaynotbeadjacentinmemoryaddresses– E.g.,differentdimensionsofamatrix– Stridecanbespecifiedaspartoftheload/store
![Page 18: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/18.jpg)
3/30/2016 CS152,Spring2016
VectorCodeExample
18
# Scalar Code LI R4, 64 loop: L.D F0, 0(R1) L.D F2, 0(R2) ADD.D F4, F2, F0 S.D F4, 0(R3) DADDIU R1, 8 DADDIU R2, 8 DADDIU R3, 8 DSUBIU R4, 1 BNEZ R4, loop
# Vector Code LI VLR, 64 LV V1, R1 LV V2, R2 ADDV.D V3, V1, V2 SV V3, R3
# C code for (i=0; i<64; i++) C[i] = A[i] + B[i];
![Page 19: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/19.jpg)
3/30/2016 CS152,Spring2016
Flynn’sTaxonomy
19
§ Singleinstruc@on,singledata(SISD)– E.g.,ourin-orderprocessor
§ Singleinstruc@on,mul@pledata(SIMD)– Mul@pleprocessingelements,sameopera@on,differentdata– Vector– Mul@pleprocessingunitsexecutethesameinstruc@onondifferentdatainalockstep.Eitherallcompleteornonedo.Therefore,allunitshavetoexecutethesameinstruc@onatagiven@me
§ Mul@pleinstruc@on,mul@pledata(MIMD)– Mul@pleautonomousprocessorsexecu@ngdifferentinstruc@onsondifferentdata
– Mostcommonandgeneralparallelmachine
§ Mul@pleinstruc@on,singledata(MISD)– Whywouldanyonedothis?
![Page 20: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/20.jpg)
3/30/2016 CS152,Spring2016
MoreCategories
20
§ Singleprogram,mul@pledata(SPMD)– Mul@pleautonomousprocessorsexecutetheprogramatindependentpoints
– DifferencewithSIMD:SIMDimposesalockstep– ProgramsatSPMDcanbeatindependentpoints– SPMDcanrunongeneralpurposeprocessors– Mostcommonmethodforparallelcompu@ng
§ Mul@pleprogram,mul@pledata(MPMD)– Mul@pleautonomousprocessorssimultaneouslyopera@ngatleast2independentprograms
![Page 21: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/21.jpg)
3/30/2016 CS152,Spring2016
VectorSupercomputers
§ Epitomy:Cray-1,1976§ ScalarUnit
– Load/StoreArchitecture
§ VectorExtension– VectorRegisters– VectorInstruc@ons
§ Implementa@on– HardwiredControl– HighlyPipelinedFunc@onalUnits– InterleavedMemorySystem– NoDataCaches– NoVirtualMemory
21
![Page 22: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/22.jpg)
3/30/2016 CS152,Spring2016
Cray-1(1976)
22
Single Port Memory 16 banks of 64-bit words
+ 8-bit SECDED
80MW/sec data load/store 320MW/sec instruction buffer refill
4 Instruction Buffers
64-bitx16 NIP
LIP
CIP
(A0)
( (Ah) + j k m )
64 T Regs
(A0)
( (Ah) + j k m )
64 B Regs
S0 S1 S2 S3 S4 S5 S6 S7
A0 A1 A2 A3 A4 A5 A6 A7
Si
Tjk
Ai
Bjk
FP Add FP Mul FP Recip
Int Add Int Logic Int Shift Pop Cnt
Sj
Si
Sk
Addr Add Addr Mul
Aj
Ai
Ak
memory bank cycle 50 ns processor cycle 12.5 ns (80MHz)
V0 V1 V2 V3 V4 V5 V6 V7
Vk
Vj
Vi V. Mask
V. Length 64 Element Vector Registers
![Page 23: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/23.jpg)
3/30/2016 CS152,Spring2016
VectorInstruc?onSetAdvantages
§ Compact– oneshortinstruc@onencodesNopera@ons
§ Expressive,tellshardwarethattheseNopera@ons:– areindependent– usethesamefunc@onalunit– accessdisjointregisters– accessregistersinsamepauernaspreviousinstruc@ons– accessacon@guousblockofmemory(unit-strideload/store)
– accessmemoryinaknownpauern(stridedload/store)
§ Scalable– canrunsamecodeonmoreparallelpipelines(lanes)
23
![Page 24: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/24.jpg)
3/30/2016 CS152,Spring2016
VectorArithme?cExecu?on
24
• Usedeeppipeline(=>fastclock)toexecuteelementopera@ons
• Simplifiescontrolofdeeppipelinebecauseelementsinvectorareindependent(=>nohazards!)
V1
V2
V3
V3<-v1*v2
SixstagemulGplypipeline
![Page 25: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/25.jpg)
3/30/2016 CS152,Spring2016
VectorInstruc?onExecu?on
25
ADDV C,A,B
C[1]
C[2]
C[0]
A[3] B[3]
A[4] B[4]
A[5] B[5]
A[6] B[6]
Execution using one pipelined functional unit
C[4]
C[8]
C[0]
A[12] B[12]
A[16] B[16]
A[20] B[20]
A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]
A[17] B[17]
A[21] B[21]
A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]
A[18] B[18]
A[22] B[22]
A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]
A[19] B[19]
A[23] B[23]
A[27] B[27]
Execution using four pipelined functional units
![Page 26: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/26.jpg)
3/30/2016 CS152,Spring2016
HowDoVectorArchitecturesAffectMemory?
26
![Page 27: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/27.jpg)
3/30/2016 CS152,Spring2016
InterleavedVectorMemorySystem
27
0 1 2 3 4 5 6 7 8 9 A B C D E F
+
Base Stride Vector Registers
Memory Banks
Address Generator
Cray-1, 16 banks, 4 cycle bank busy time, 12 cycle latency
• Bank busy time: Time before bank ready to accept next request
![Page 28: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/28.jpg)
3/30/2016 CS152,Spring2016
VectorUnitStructure
28
Lane
FuncGonalUnit
VectorRegisters
MemorySubsystem
Elements0,4,8,…
Elements1,5,9,…
Elements2,6,10,…
Elements3,7,11,…
![Page 29: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/29.jpg)
3/30/2016 CS152,Spring2016
T0VectorMicroprocessor(UCB/ICSI,1995)
29
LaneVectorregisterelementsstriped
overlanes
[0] [8] [16] [24]
[1] [9] [17] [25]
[2] [10] [18] [26]
[3] [11] [19] [27]
[4] [12] [20] [28]
[5] [13] [21] [29]
[6] [14] [22] [30]
[7] [15] [23] [31]
![Page 30: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/30.jpg)
3/30/2016 CS152,Spring2016
VectorInstruc?onParallelism§ Canoverlapexecu@onofmul@plevectorinstruc@ons
– examplemachinehas32elementspervectorregisterand8lanes
30
load
load mul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Instruction issue
Complete24opera@ons/cyclewhileissuing1shortinstruc@on/cycle
![Page 31: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/31.jpg)
3/30/2016 CS152,Spring2016
VectorChaining
31
§ Vectorversionofregisterbypassing– introducedwithCray-1
Memory
V1
Load Unit
Mult.
V2
V3
Chain
Add
V4
V5
Chain
LV v1
MULV v3,v1,v2
ADDV v5, v3, v4
![Page 32: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/32.jpg)
3/30/2016 CS152,Spring2016
VectorChainingAdvantage
32
• Withchaining,canstartdependentinstruc@onassoonasfirstresultappears
LoadMul
Add
LoadMul
AddTime
• Withoutchaining,mustwaitforlastelementofresulttobewriuenbeforestar@ngdependentinstruc@on
![Page 33: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/33.jpg)
3/30/2016 CS152,Spring2016
VectorStartup§ Twocomponentsofvectorstartuppenalty
– func@onalunitlatency(@methroughpipeline)– dead@meorrecovery@me(@mebeforeanothervectorinstruc@oncanstartdownpipeline).Somepipelinesreducecontrollogicbyrequiringdead@mebetweeninstruc@onstothesamevectorunit
33
R X X X WR X X X W
R X X X WR X X X W
R X X X WR X X X W
R X X X W
R X X X WR X X X W
R X X X W
Func@onalUnitLatency
DeadTime
FirstVectorInstruc@on
SecondVectorInstruc@on
DeadTime
![Page 34: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/34.jpg)
3/30/2016 CS152,Spring2016
DeadTimeandShortVectors
34
Cray C90, Two lanes 4 cycle dead time
Maximum efficiency 94% with 128 element vectors
4 cycles dead time T0, Eight lanes No dead time
100% efficiency with 8 element vectors
No dead time
64 cycles active
![Page 35: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/35.jpg)
3/30/2016 CS152,Spring2016
VectorMemory-MemoryversusVectorRegisterMachines
§ Vectormemory-memoryinstruc@onsholdallvectoroperandsinmainmemory
§ Thefirstvectormachines,CDCStar-100(‘73)andTIASC(‘71),werememory-memorymachines
§ Cray-1(’76)wasfirstvectorregistermachine
35
for (i=0; i<N; i++) { C[i] = A[i] + B[i]; D[i] = A[i] - B[i]; }
Example Source Code ADDV C, A, B SUBV D, A, B
Vector Memory-Memory Code
LV V1, A LV V2, B ADDV V3, V1, V2 SV V3, C SUBV V4, V1, V2 SV V4, D
Vector Register Code
![Page 36: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/36.jpg)
3/30/2016 CS152,Spring2016
VectorMemory-Memoryvs.VectorRegisterMachines
§ Vectormemory-memoryarchitectures(VMMA)requiregreatermainmemorybandwidth,why?– Alloperandsmustbereadinandoutofmemory
§ VMMAsmakeifdifficulttooverlapexecu@onofmul@plevectoropera@ons,why?– Mustcheckdependenciesonmemoryaddresses
§ VMMAsincurgreaterstartuplatency– ScalarcodewasfasteronCDCStar-100(VMM)forvectors<100elements
§ ApartfromCDCfollow-ons(Cyber-205,ETA-10)allmajorvectormachinessinceCray-1havehadvectorregisterarchitectures
§ (weignorevectormemory-memoryfromnowon)
36
![Page 37: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/37.jpg)
3/30/2016 CS152,Spring2016
Automa?cCodeVectoriza?on
37
for (i=0; i < N; i++) C[i] = A[i] + B[i];
loadload
add
store
loadload
add
store
Iter.1
Iter.2
ScalarSequenGalCode
Vectoriza@onisamassivecompile-@mereorderingofopera@onsequencing
⇒requiresextensiveloopdependenceanalysis
VectorInstrucGon
load
load
add
store
load
load
add
store
Iter.1 Iter.2
VectorizedCode
Time
![Page 38: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/38.jpg)
3/30/2016 CS152,Spring2016
VectorStripminingProblem:VectorregistershavefinitelengthSolu?on:Breakloopsintopiecesthatfitinregisters,“Stripmining”
38
ANDI R1, N, 63 # N mod 64 MTC1 VLR, R1 # Do remainder loop: LV V1, RA DSLL R2, R1, 3 # Multiply by 8 DADDU RA, RA, R2 # Bump pointer LV V2, RB DADDU RB, RB, R2 ADDV.D V3, V1, V2 SV V3, RC DADDU RC, RC, R2 DSUBU N, N, R1 # Subtract elements LI R1, 64 MTC1 VLR, R1 # Reset full length BGTZ N, loop # Any more to do?
for (i=0; i<N; i++) C[i] = A[i]+B[i];
+
+
+
A B C
64 elements
Remainder
![Page 39: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/39.jpg)
3/30/2016 CS152,Spring2016
VectorCondi?onalExecu?on
39
Problem: Want to vectorize loops with conditional code: for (i=0; i<N; i++) if (A[i]>0) then A[i] = B[i];
Solution: Add vector mask (or flag) registers – vector version of predicate registers, 1 bit per element
…and maskable vector instructions – vector operation becomes bubble (“NOP”) at elements where mask bit is clear
Code example: CVM # Turn on all elements LV vA, rA # Load entire A vector SGTVS.D vA, F0 # Set bits in mask register where A>0 LV vA, rB # Load B vector into A under mask SV vA, rA # Store A back to memory under mask
![Page 40: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/40.jpg)
3/30/2016 CS152,Spring2016
MaskedVectorInstruc?ons
40
C[4]
C[5]
C[1]
Write data port
A[7] B[7]
M[3]=0
M[4]=1
M[5]=1
M[6]=0
M[2]=0
M[1]=1
M[0]=0
M[7]=1
Density-TimeImplementa@on– scanmaskvectorandonlyexecuteelementswithnon-zeromasks
C[1]
C[2]
C[0]
A[3] B[3]
A[4] B[4]
A[5] B[5]
A[6] B[6]
M[3]=0
M[4]=1
M[5]=1
M[6]=0
M[2]=0
M[1]=1
M[0]=0
Write data port Write Enable
A[7] B[7] M[7]=1
SimpleImplementa@on– executeallNopera@ons,turnoffresultwritebackaccordingtomask
![Page 41: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/41.jpg)
3/30/2016 CS152,Spring2016
VectorReduc?ons
41
Problem:Loop-carrieddependenceonreduc@onvariablessum = 0; for (i=0; i<N; i++) sum += A[i]; # Loop-carried dependence on sum
Solu?on:Re-associateopera@onsifpossible,usebinarytreetoperformreduc@on # Rearrange as: sum[0:VL-1] = 0 # Vector of VL partial sums for(i=0; i<N; i+=VL) # Stripmine VL-sized chunks sum[0:VL-1] += A[i:i+VL-1]; # Vector sum # Now have VL partial sums in one vector register do { VL = VL/2; # Halve vector length sum[0:VL-1] += sum[VL:2*VL-1] # Halve no. of partials } while (VL>1)
![Page 42: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/42.jpg)
3/30/2016 CS152,Spring2016
VectorScacer/Gather
42
Wanttovectorizeloopswithindirectaccesses:for (i=0; i<N; i++) A[i] = B[i] + C[D[i]]
Indexedloadinstruc@on(Gather)LV vD, rD # Load indices in D vector LVI vC, rC, vD # Load indirect from rC base LV vB, rB # Load B vector ADDV.D vA,vB,vC # Do add SV vA, rA # Store result
![Page 43: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/43.jpg)
3/30/2016 CS152,Spring2016
VectorScacer/Gather
43
Histogramexample:for (i=0; i<N; i++) A[B[i]]++;
Isfollowingacorrecttransla@on? LV vB, rB # Load indices in B vector LVI vA, rA, vB # Gather initial A values ADDV vA, vA, 1 # Increment SVI vA, rA, vB # Scatter incremented values
![Page 44: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/44.jpg)
3/30/2016 CS152,Spring2016
AModernVectorSuper:NECSX-9(2008)§ 65nmCMOStechnology§ Vectorunit(3.2GHz)
– 8foregroundVRegs+64backgroundVRegs(256x64-bitelements/VReg)
– 64-bitfunc@onalunits:2mul@ply,2add,1divide/sqrt,1logical,1maskunit
– 8lanes(32+FLOPS/cycle,100+GFLOPSpeakperCPU)
– 1loadorstoreunit(8x8-byteaccesses/cycle)
§ Scalarunit(1.6GHz)– 4-waysuperscalarwithout-of-orderandspecula@veexecu@on
– 64KBI-cacheand64KBdatacache
44
• Memorysystemprovides256GB/sDRAMbandwidthperCPU• Upto16CPUsandupto1TBDRAMformshared-memorynode
– totalof4TB/sbandwidthtosharedDRAMmemory
• Upto512nodesconnectedvia128GB/snetworklinks(messagepassingbetweennodes)
![Page 45: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/45.jpg)
3/30/2016 CS152,Spring2016
Mul?mediaExtensions(akaSIMDextensions)
45
§ Veryshortvectorsaddedtoexis@ngISAsformicroprocessors§ Useexis@ng64-bitregisterssplitinto2x32bor4x16bor8x8b
– LincolnLabsTX-2from1957had36bdatapathsplitinto2x18bor4x9b– Newerdesignshavewiderregisters
• 128bforPowerPCAl@vec,IntelSSE2/3/4• 256bforIntelAVX
§ Singleinstruc@onoperatesonallelementswithinregister
16b 16b 16b 16b
32b 32b
64b
8b 8b 8b 8b 8b 8b 8b 8b
16b 16b 16b 16b
16b 16b 16b 16b
16b 16b 16b 16b
+ + + + 4x16badds
![Page 46: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/46.jpg)
3/30/2016 CS152,Spring2016
Mul?mediaExtensionsversusVectors
§ Limitedinstruc@onset:– novectorlengthcontrol– nostridedload/storeorscauer/gather– unit-strideloadsmustbealignedto64/128-bitboundary
§ Limitedvectorregisterlength:– requiressuperscalardispatchtokeepmul@ply/add/loadunitsbusy– loopunrollingtohidelatenciesincreasesregisterpressure
§ Trendtowardsfullervectorsupportinmicroprocessors– Beuersupportformisalignedmemoryaccesses– Supportofdouble-precision(64-bitfloa@ng-point)– NewIntelAVXspec(announcedApril2008),256bvectorregisters(expandableupto1024b)
46
![Page 47: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/47.jpg)
3/30/2016 CS152,Spring2016
DegreeofVectoriza?on
§ Compilersaregoodatfindingdata-levelparallelism
47
![Page 48: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/48.jpg)
3/30/2016 CS152,Spring2016
AverageVectorLength
§ Maximumdependsonifbecnhmarksuse16bitor32bitopera@ons
48
![Page 49: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/49.jpg)
3/30/2016 CS152,Spring2016
Distribu?onofInstruc?ons
49
![Page 50: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/50.jpg)
3/30/2016 CS152,Spring2016
Ques?onoftheDay
50
§ CanVectorandVLIWcombine?
§ Yes!§ FujitsyFR-VcanprocessbothVLIWandvectorinstruc@ons§ Exploitsbothinstruc@on-anddata-levelparallelism
![Page 51: CS 152 Computer Architecture and Engineering Lecture 15 ...cs152/sp16/lectures/L15...3/30/2016 CS152, Spring 2016 CS 152 Computer Architecture and Engineering Lecture 15: Vector Computers](https://reader035.vdocuments.mx/reader035/viewer/2022071515/6136c9520ad5d20676483e52/html5/thumbnails/51.jpg)
3/30/2016 CS152,Spring2016
Acknowledgements
§ Theseslidescontainmaterialdevelopedandcopyrightby:– Arvind(MIT)– KrsteAsanovic(MIT/UCB)– JoelEmer(Intel/MIT)– JamesHoe(CMU)– JohnKubiatowicz(UCB)– DavidPauerson(UCB)
§ MITmaterialderivedfromcourse6.823§ UCBmaterialderivedfromcourseCS252§ “VectorVs.SuperscalarandVLIWArchitecturesforEmbeddedMul@mediaBenchmarks”.ChristosKozyrakisandDavidPauerson.MICRO-35.2002
51