lecture 02: parallel architecture
TRANSCRIPT
![Page 1: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/1.jpg)
Lecture02:ParallelArchitectureILP:Instruc4onLevelParallelism,TLP:ThreadLevel
ParallelismandDLP:DataLevelParallelism
CSCE790:ParallelProgrammingModelsforMul4coreandManycoreProcessors
DepartmentofComputerScienceandEngineering
h:p://cse.sc.edu/~yanyh
1
![Page 2: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/2.jpg)
Flynn’sTaxonomyofParallelArchitectures
h:ps://en.wikipedia.org/wiki/Flynn%27s_taxonomy
![Page 3: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/3.jpg)
SISD:SingleInstruc4onSingleData
• AtoneIme,oneinstrucIonoperatesononedata• BasedontradiIonalVonNeumannuniprocessorarchitecture– instrucIonsareexecutedsequenIallyorserially,onestepaOerthe
next.• UnIlmostrecently,mostcomputersareofSISDtype.
![Page 4: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/4.jpg)
SIMD:SingleInstruc4onMul4pleData
• Alsoknownasarray-processorsfromearlyon• AsingleinstrucIonstreamisbroadcastedtomulIpleprocessors,eachhavingitsowndatastream– SIllusedinsomegraphicscardstoday
InstrucIonsstream
processor processor processor processor
Data Data Data Data
Controlunit
![Page 5: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/5.jpg)
MIMD:Mul4pleInstruc4onsMul4pleData
• EachprocessorhasitsowninstrucIonstreamandinputdata
• Verygeneralcase– everyotherscenariocanbemappedtoMIMD• FurtherbreakdownofMIMDusuallybasedonthememoryorganizaIon– Sharedmemorysystems– Distributedmemorysystems
![Page 6: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/6.jpg)
ParallelisminHardwareArchitecture
• SISD:inherentlysequenIal– InstrucIonLevelParallel:overlappingexecuIonof
instrucIonsthroughpipeliningsincewecansplitaninstrucIonexecuIonintomulIplestages
– Out-of-OrderexecuIon– SpeculaIon– Superscalar• SIMD:Inherentlyparallelwithconstraints– DataLevelParallel:OneinstrucIonstreammulIpledata• MIMD:Inherentlyparallel– ThreadLevelParallel:mulIpleinstrucIonstreams
independently
6
![Page 7: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/7.jpg)
Abstrac4on:LevelsofRepresenta4on/Interpreta4on
lw $t0,0($2)lw $t1,4($2)sw $t1,0($2)sw $t0,4($2)
HighLevelLanguageProgram(e.g.,C)
AssemblyLanguageProgram(e.g.,MIPS)
MachineLanguageProgram(MIPS)
HardwareArchitectureDescrip4on(e.g.,blockdiagrams)
Compiler
Assembler
MachineInterpreta4on
temp=v[k];v[k]=v[k+1];v[k+1]=temp;
0000 1001 1100 0110 1010 1111 0101 1000 1010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111 !
LogicCircuitDescrip4on(CircuitSchema4cDiagrams)
ArchitectureImplementa4on
Anythingcanberepresentedasanumber,
i.e.,dataorinstrucIons
7
![Page 8: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/8.jpg)
Instruc4onLevelParallelism
• InstrucIonexecuIoncanbedividedintomulIplestages– 5stagesinRISC– Instruc4onfetchcycle(IF):sendPCtomemory,fetchthecurrent
instrucIonfrommemory,andupdatePCtothenextsequenIalPCbyadding4tothePC.
– Instruc4ondecode/registerfetchcycle(ID):decodetheinstrucIon,readtheregisterscorrespondingtoregistersourcespecifiersfromtheregisterfile.
– Execu4on/effec4veaddresscycle(EX):performMemoryaddresscalculaIonforLoad/Store,Register-RegisterALUinstrucIonandRegister-ImmediateALUinstrucIon.
– Memoryaccess(MEM):Performmemoryaccessforload/storeinstrucIons.
– Write-backcycle(WB):WritebackresultstothedestoperandsforRegister-RegisterALUinstrucIonorLoadinstrucIon.
8
![Page 9: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/9.jpg)
PipelinedInstruc4onExecu4on
9
I n s t r. O r d e r
Time (clock cycles)
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Reg ALU
DMem Ifetch Reg
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5
![Page 10: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/10.jpg)
Pipelining: Its Natural!
• Laundry Example • Ann, Brian, Cathy, Dave
each have one load of clothes to wash, dry, and fold – Washer takes 30 minutes – Dryer takes 40 minutes – “Folder” takes 20 minutes
• One load: 90 minutes
A B C D
![Page 11: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/11.jpg)
Sequential Laundry
• Sequential laundry takes 6 hours for 4 loads • If they learned pipelining, how long would laundry take?
A
B
C
D
30 40 20 30 40 20 30 40 20 30 40 20
6PM 7 8 9 10 11 Midnight
TaskOrder
Time
![Page 12: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/12.jpg)
Pipelined Laundry Start Work ASAP
• Pipelined laundry takes 3.5 hours for 4 loads
A
B
C
D
6PM 7 8 9 10 11 Midnight
TaskOrder
Time
30 40 40 40 40 20
Sequential laundry takes 6 hours for 4 loads
![Page 13: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/13.jpg)
Classic5-StagePipelineforaRISC
• EachcyclethehardwarewilliniIateanewinstrucIonandwillbeexecuIngsomepartofthefivedifferentinstrucIons.– OnecycleperinstrucIonvs5cycleperinstrucIon
Clock number
Instruction number 1 2 3 4 5 6 7 8 9
Instruction i IF ID EX MEM WB
Instruction i+1 IF ID EX MEM WB
Instruction i+2 IF ID EX MEM WB
Instruction i+3 IF ID EX MEM WB
Instruction i+4 IF ID EX MEM WB
![Page 14: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/14.jpg)
PipelineandSuperscalar
14
![Page 15: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/15.jpg)
AdvancedILP
• DynamicSchedulingàOut-of-orderExecuIon• SpeculaIonàIn-orderCommit• SuperscalaràMulIpleIssue
Techniques Goals Implementa4on Addressing Approaches
DynamicScheduling
Out-of-orderexecu4on
Reserva4onSta4ons,Load/StoreBufferandCDB
Datahazards(RAW,WAW,WAR)
Registerrenaming
Specula4on In-ordercommit
BranchPredic4on(BHT/BTB)andReorderBuffer
Controlhazards(branch,func,excep4on)
Predic4onandmispredic4onrecovery
Superscalar/VLIW
Mul4pleissue
SocwareandHardware ToIncreaseCPI Bycompilerorhardware
![Page 16: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/16.jpg)
Problemsoftradi4onalILPscaling
• FundamentalcircuitlimitaIons1– delays⇑asissuequeues⇑andmulI-portregisterfiles⇑– increasingdelayslimitperformancereturnsfromwiderissue• LimitedamountofinstrucIon-levelparallelism1
– inefficientforcodeswithdifficult-to-predictbranches
• Powerandheatstallclockfrequencies
16
[1]Thecaseforasingle-chipmulIprocessor,K.Olukotun,B.Nayfeh,L.Hammond,K.Wilson,andK.Chang,ASPLOS-VII,1996.
![Page 17: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/17.jpg)
ILPimpacts
17
![Page 18: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/18.jpg)
Simula4onsof8-issueSuperscalar
18
![Page 19: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/19.jpg)
Power/heatdensitylimitsfrequency
19
• Somefundamentalphysicallimitsarebeingreached
![Page 20: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/20.jpg)
Wewillhavethis…
20
![Page 21: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/21.jpg)
21
Revolu4onishappeningnow• Chipdensityis
conInuingincrease~2xevery2years– Clockspeedisnot– Numberofprocessor
coresmaydoubleinstead
• Thereisli:leornohiddenparallelism(ILP)tobefound
• ParallelismmustbeexposedtoandmanagedbysoOware– Nofreelunch
Source:Intel,MicrosoO(Su:er)andStanford(Olukotun,Hammond)
![Page 22: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/22.jpg)
CurrentTrendsinArchitecture
• CannotconInuetoleverageInstrucIon-Levelparallelism(ILP)– Singleprocessorperformanceimprovementendedin2003
• Recentmodelsforperformance:– ToexploreData-levelparallelism(DLP)viaSIMDarchitectureandGPUs
– ToexploreThread-levelparallelism(TLP)viaMIMD
– Others
22
![Page 23: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/23.jpg)
SIMD:SingleInstruc4on,Mul4pleData(DataLevelParallelism)
• SIMDarchitecturescanexploitsignificantdata-levelparallelismfor:– matrix-orientedscienIficcompuIng– media-orientedimageandsoundprocessors• SIMDismoreenergyefficientthanMIMD– OnlyneedstofetchoneinstrucIonperdataoperaIon
processingmulIpledataelements– MakesSIMDa:racIveforpersonalmobiledevices• SIMDallowsprogrammertoconInuetothinksequenIally
InstrucIonsstream
processor processor processor processor
Data Data Data Data
Controlunit
![Page 24: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/24.jpg)
SIMDParallelism
• Threevaria4ons– Vectorarchitectures(earlyage)– SIMDextensions– GraphicsProcessorUnits(GPUs)(dedicatedweeksforGPUs)
• Forx86processors:– ExpecttwoaddiIonalcoresperchipperyear(MIMD)– SIMDwidthtodoubleeveryfouryears– PotenIalspeedupfromSIMDtobetwicethatfromMIMD!
![Page 25: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/25.jpg)
VectorArchitectures
• VectorprocessorsabstractoperaIonsonvectors,e.g.replacethefollowingloop
by• Somelanguagesofferhigh-levelsupportfortheseoperaIons(e.g.Fortran90ornewer)
for (i=0; i<n; i++) { a[i] = b[i] + c[i];
}
a = b + c; ADDV.D V10, V8, V6
![Page 26: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/26.jpg)
VectorProgrammingModel
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic Instructions ADDV v3, v1, v2 v3
v2 v1
Scalar Registers
r0
r15 Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1] VLR Vector Length Register
v1 Vector Load and Store Instructions LV v1, (r1, r2)
Base, r1 Stride in r2 Memory
Vector Register
![Page 27: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/27.jpg)
VectorwasSupercomputers• Epitomy:Cray-1,1976• ScalarUnit– Load/StoreArchitecture
• VectorExtension– VectorRegisters– VectorInstrucIons
• ImplementaIon– HardwiredControl– HighlyPipelinedFuncIonalUnits– InterleavedMemorySystem– NoDataCaches– NoVirtualMemory
![Page 28: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/28.jpg)
AXPY(64elements)(Y=a*X+Y)inMIPSandVMIPS
• #instrs:– 6vs~600• Pipelinestalls– 64xhigherbyMIPS• Vectorchaining(forwarding)– V1,V2,V3andV4
for (i=0; i<64; i++) Y[i] = a* X[i] + Y[i];
ThestarIngaddressesofXandYareinRxandRy,respecIvely
![Page 29: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/29.jpg)
SIMDInstruc4ons
• OriginallydevelopedforMulImediaapplicaIons• SameoperaIonexecutedformulIpledataitems• UsesafixedlengthregisterandparIIonsthecarrychaintoallowuIlizingthesamefuncIonalunitformulIpleoperaIons– E.g.a64bitaddercanbeuIlizedfortwo32-bitaddoperaIonssimultaneously
![Page 30: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/30.jpg)
SIMDInstruc4ons
• MMX(Mult-MediaExtension)-1996– ExisIng64bitfloaIngpointregistercouldbeusedforeight8-
bitoperaIonsorfour16-bitoperaIons• SSE(StreamingSIMDExtension)–1999– SuccessortoMMXinstrucIons– Separate128-bitregistersaddedforsixteen8-bit,eight16-bit,
orfour32-bitoperaIons• SSE2–2001,SSE3–2004,SSE4-2007– AddedsupportfordoubleprecisionoperaIons• AVX(AdvancedVectorExtensions)-2010– 256-bitregistersadded
![Page 31: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/31.jpg)
AXPY
• 256-bitSIMDexts– 4doubleFP
• MIPS:578insts• SIMDMIPS:149– 4×reducIon
• VMIPS:6instrs– 100×reducIon
for (i=0; i<64; i++) Y[i] = a* X[i] + Y[i];
![Page 32: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/32.jpg)
StateoftheArt:IntelXeonPhiManycoreVectorCapability
• IntelXeonPhiKnightCorner,2012,~60cores,4-waySMT• IntelXeonPhiKnightLanding,2016,~60cores,4-waySMTandHBM– h:p://www.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.25-Tuesday-
Epub/HC27.25.70-Processors-Epub/HC27.25.710-Knights-Landing-Sodani-Intel.pdf
h:p://primeurmagazine.com/repository/PrimeurMagazine-AE-PR-12-14-32.pdf
![Page 33: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/33.jpg)
StateoftheArt:ARMScalableVectorExtensions(SVE)
• AnnouncedinAugust2016– h:ps://community.arm.com/groups/processors/blog/
2016/08/22/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture
– h:p://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.22-Monday-Epub/HC28.22.10-GPU-HPC-Epub/HC28.22.131-ARMv8-vector-Stephens-Yoshida-ARM-v8-23_51-v11.pdf
• Beyondvectorarchitecturewelearned– Vectorloop,predictandspeculaIon– VectorLengthAgnosIc(VLA)programming
– Checktheslide
![Page 34: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/34.jpg)
Limita4onsofop4mizingasingleinstruc4onstream
• Problem:withinasingleinstrucIonstreamwedonotfindenoughindependentinstrucIonstoexecutesimultaneouslydueto– datadependencies– limitaIonsofspeculaIveexecuIonacrossmulIplebranches– difficulIestodetectmemorydependenciesamonginstrucIon
(aliasanalysis)• Consequence:significantnumberoffuncIonalunitsareidlingat
anygivenIme• QuesIon:CanwemaybeexecuteinstrucIonsfromanother
instrucIonsstream– Anotherthread?– Anotherprocess?
![Page 35: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/35.jpg)
Thread-levelparallelism
• ProblemsforexecuInginstrucIonsfrommulIplethreadsatthesameIme– TheinstrucIonsineachthreadmightusethesameregister
names– Eachthreadhasitsownprogramcounter• VirtualmemorymanagementallowsfortheexecuIonofmulIplethreadsandsharingofthemainmemory
• Whentoswitchbetweendifferentthreads:– FinegrainmulIthreading:switchesbetweeneveryinstrucIon– CoursegrainmulIthreading:switchesonlyoncostlystalls(e.g.
level2cachemisses)
![Page 36: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/36.jpg)
ConvertThread-levelparallelismtoinstruc4on-levelparallelism
Time (
proc
esso
r cyc
le) Superscalar Fine-Grained Coarse-Grained
Simultaneous Multithreading
Thread 1 Thread 2
Thread 3 Thread 4
Thread 5 Idle slot
![Page 37: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/37.jpg)
ILPtoDoTLP:e.g.SimultaneousMul4-Threading
• Workswellif– Numberofcomputeintensivethreadsdoesnotexceedthenumberof
threadssupportedinSMT– ThreadshavehighlydifferentcharacterisIcs(e.g.onethreaddoingmostly
integeroperaIons,anothermainlydoingfloaIngpointoperaIons)• Doesnotworkwellif– ThreadstrytouIlizethesamefuncIonunits• e.g.adualprocessorsystem,eachprocessorsupporIng2threadssimultaneously(OSthinksthereare4processors)
• 2computeintensiveapplicaIonprocessesmightenduponthesameprocessorinsteadofdifferentprocessors(OSdoesnotseethedifferencebetweenSMTandrealprocessors!)
![Page 38: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/38.jpg)
Power,FrequencyandILP
Note:EvenMoore’sLawisendingaround2021:h:p://spectrum.ieee.org/semiconductors/devices/transistors-could-stop-shrinking-in-2021h:ps://www.technologyreview.com/s/601441/moores-law-is-dead-now-what/h:p://www.forbes.com/sites/Imworstall/2016/07/26/economics-is-important-the-end-of-moores-law
CPUfrequencyincreasewasfla:enedaround2000-2005Twomainreasons:1. LimitedILPand2. PowerconsumpIonandheat
dissipaIon
![Page 39: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/39.jpg)
History–Past(2000)andToday
![Page 40: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/40.jpg)
Flynn’sTaxonomy
h:ps://en.wikipedia.org/wiki/Flynn%27s_taxonomy
✔✔
✖
![Page 41: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/41.jpg)
ExamplesofMIMDMachines• SymmetricShared-Memory
MulIprocessor(SMP)– MulIpleprocessorsinboxwith
sharedmemorycommunicaIon– CurrentMulIcorechipslikethis– EveryprocessorrunscopyofOS• Distributed/Non-uniformShared-
MemoryMulIprocessor– MulIpleprocessors• Eachwithlocalmemory• generalscalablenetwork
– Extremelylight“OS”onnodeprovidessimpleservices• Scheduling/synchronizaIon
– Network-accessiblehostforI/O• Cluster– Manyindependentmachine
connectedwithgeneralnetwork– CommunicaIonthroughmessages
P P P P
Bus
Memory
P/M P/M P/M P/M
P/M P/M P/M P/M
P/M P/M P/M P/M
P/M P/M P/M P/M
Host
Network
![Page 42: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/42.jpg)
Symmetric(Shared-Memory)Mul4processors(SMP)
• Smallnumbersofcores– Typicallyeightorfewer,andno
morethan32inmostcases• Shareasinglecentralizedmemorythatallprocessorshaveequalaccessto,– Hencethetermsymmetric.• AllexisIngmulIcoresareSMPs.
• Alsocalleduniformmemoryaccess(UMA)mulIprocessors– allprocessorshaveauniform
latency
![Page 43: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/43.jpg)
CentralizedSharedMemorySystem(I)
• MulI-coreprocessors– Typicallyconnectedoveracache,– PreviousSMPsystemsweretypicallyconnectedoverthemain
memory• IntelX7350quad-core(Tigerton)– PrivateL1cache:32KBinstrucIon,32KBdata– SharedL2cache:4MBunifiedcache
CoreL1
CoreL1
sharedL2
CoreL1
CoreL1
sharedL2
1066MHzFSB
![Page 44: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/44.jpg)
CentralizedSharedMemorySystem(SMP)(II)
• IntelX7350quad-core(Tigerton)mulI-processorconfiguraIon
C0
C1
L2
C8
C9
L2
C2
C3
L2
C10
C11
L2
C4
C5
L2
C12
C13
L2
C6
C7
L2
C14
C15
L2
Socket0 Socket1 Socket2 Socket3
MemoryControllerHub(MCH)
Memory Memory Memory Memory
8GB/s8GB/s8GB/s8GB/s
![Page 45: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/45.jpg)
DistributedShared-MemoryMul4processor• Largeprocessorcount– 64to1000s• Distributedmemory– Remotevslocalmemory– Longvsshortlatency– Highvslowlatency
§ Interconnec4onnetwork– Bandwidth,topology,etc
§ Nonuniformmemoryaccess(NUMA)
§ EachprocessormayhaslocalI/O
![Page 46: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/46.jpg)
DistributedShared-MemoryMul4processor(NUMA)
• Reducesthememorybo:leneckcomparedtoSMPs• Moredifficulttoprogramefficiently– E.g.firsttouchpolicy:dataitemwillbelocatedinthememory
oftheprocessorwhichusesadataitemfirst• Toreduceeffectsofnon-uniformmemoryaccess,cachesareoOenused– ccNUMA:cache-coherentnon-uniformmemoryaccess
architectures• Largestexampleasoftoday:SGIOriginwith512processors
![Page 47: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/47.jpg)
Shared-MemoryMul4processor
• SMPandDSMareallsharedmemorymulIprocessors– UMAorNUMA• MulIcoreareSMPsharedmemory• MostmulI-CPUmachinesareDSM– NUMA
• SharedAddressSpace(VirtualAddressSpace)– Notalwayssharedmemory
![Page 48: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/48.jpg)
CurrentTrendsinComputerArchitecture• CannotconInuetoleverageILP– Singleprocessorperformanceimprovementendedin2003
• Currentmodelsforperformance:– ToexploreData-levelparallelism(DLP)viaSIMDarchitecture(vector,SIMD
extensionsandGPUs)– ToexploreThread-levelparallelism(TLP)viaMIMD
– Heterogeneity:integratemul4pleanddifferentarchitecturestogetherinchip/systemlevel
• Emergingarchitectures– Domain-specificarchitectures:DeepLearningPU(e.g.TPU,etc)– E.g.MachineLearningPullsProcessorArchitecturesontoNewPath• hJps://www.top500.org/news/machine-learning-pulls-processor-architectures-onto-new-path/
Theserequireexplicitrestructuringoftheapplica4onßParallelProgramming
48
![Page 49: Lecture 02: Parallel Architecture](https://reader030.vdocuments.mx/reader030/viewer/2022012915/61c48dcb4b427354d84e51e3/html5/thumbnails/49.jpg)
The“Future”ofMoore’sLaw
• ThechipsaredownforMoore’slaw– h:p://www.nature.com/news/the-chips-are-down-for-moore-
s-law-1.19338• SpecialReport:50YearsofMoore'sLaw– h:p://spectrum.ieee.org/staIc/special-report-50-years-of-
moores-law• Moore’slawreallyisdeadthisIme– h:p://arstechnica.com/informaIon-technology/2016/02/
moores-law-really-is-dead-this-Ime/• RebooIngtheITRevoluIon:ACalltoAcIon(SIA/SRC,2015)– h:ps://www.semiconductors.org/clientuploads/Resources/
RITR%20WEB%20version%20FINAL.pdf
49