lecture 02: parallel architecture

Lecture02:ParallelArchitectureILP:Instruc4onLevelParallelism,TLP:ThreadLevel

ParallelismandDLP:DataLevelParallelism

CSCE790:ParallelProgrammingModelsforMul4coreandManycoreProcessors

DepartmentofComputerScienceandEngineering

[email protected]

h:p://cse.sc.edu/~yanyh

1

Flynn’sTaxonomyofParallelArchitectures

h:ps://en.wikipedia.org/wiki/Flynn%27s_taxonomy

SISD:SingleInstruc4onSingleData

•  AtoneIme,oneinstrucIonoperatesononedata•  BasedontradiIonalVonNeumannuniprocessorarchitecture–  instrucIonsareexecutedsequenIallyorserially,onestepaOerthe

next.•  UnIlmostrecently,mostcomputersareofSISDtype.

SIMD:SingleInstruc4onMul4pleData

•  Alsoknownasarray-processorsfromearlyon•  AsingleinstrucIonstreamisbroadcastedtomulIpleprocessors,eachhavingitsowndatastream–  SIllusedinsomegraphicscardstoday

InstrucIonsstream

processor processor processor processor

Data Data Data Data

Controlunit

MIMD:Mul4pleInstruc4onsMul4pleData

•  EachprocessorhasitsowninstrucIonstreamandinputdata

•  Verygeneralcase–  everyotherscenariocanbemappedtoMIMD•  FurtherbreakdownofMIMDusuallybasedonthememoryorganizaIon–  Sharedmemorysystems–  Distributedmemorysystems

ParallelisminHardwareArchitecture

•  SISD:inherentlysequenIal–  InstrucIonLevelParallel:overlappingexecuIonof

instrucIonsthroughpipeliningsincewecansplitaninstrucIonexecuIonintomulIplestages

–  Out-of-OrderexecuIon–  SpeculaIon–  Superscalar•  SIMD:Inherentlyparallelwithconstraints–  DataLevelParallel:OneinstrucIonstreammulIpledata•  MIMD:Inherentlyparallel–  ThreadLevelParallel:mulIpleinstrucIonstreams

independently

6

Abstrac4on:LevelsofRepresenta4on/Interpreta4on

lw $t0,0($2)lw $t1,4($2)sw $t1,0($2)sw $t0,4($2)

HighLevelLanguageProgram(e.g.,C)

AssemblyLanguageProgram(e.g.,MIPS)

MachineLanguageProgram(MIPS)

HardwareArchitectureDescrip4on(e.g.,blockdiagrams)

Compiler

Assembler

MachineInterpreta4on

temp=v[k];v[k]=v[k+1];v[k+1]=temp;

0000 1001 1100 0110 1010 1111 0101 1000 1010 1111 0101 1000 0000 1001 1100 0110 1100 0110 1010 1111 0101 1000 0000 1001 0101 1000 0000 1001 1100 0110 1010 1111 !

LogicCircuitDescrip4on(CircuitSchema4cDiagrams)

ArchitectureImplementa4on

Anythingcanberepresentedasanumber,

i.e.,dataorinstrucIons

7

Instruc4onLevelParallelism

•  InstrucIonexecuIoncanbedividedintomulIplestages–  5stagesinRISC–  Instruc4onfetchcycle(IF):sendPCtomemory,fetchthecurrent

instrucIonfrommemory,andupdatePCtothenextsequenIalPCbyadding4tothePC.

–  Instruc4ondecode/registerfetchcycle(ID):decodetheinstrucIon,readtheregisterscorrespondingtoregistersourcespecifiersfromtheregisterfile.

–  Execu4on/effec4veaddresscycle(EX):performMemoryaddresscalculaIonforLoad/Store,Register-RegisterALUinstrucIonandRegister-ImmediateALUinstrucIon.

–  Memoryaccess(MEM):Performmemoryaccessforload/storeinstrucIons.

–  Write-backcycle(WB):WritebackresultstothedestoperandsforRegister-RegisterALUinstrucIonorLoadinstrucIon.

8

PipelinedInstruc4onExecu4on

9

I n s t r. O r d e r

Time (clock cycles)

Reg ALU

DMem Ifetch Reg

Reg ALU

DMem Ifetch Reg

Reg ALU

DMem Ifetch Reg

Reg ALU

DMem Ifetch Reg

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5

Pipelining: Its Natural!

•  Laundry Example •  Ann, Brian, Cathy, Dave

each have one load of clothes to wash, dry, and fold –  Washer takes 30 minutes –  Dryer takes 40 minutes –  “Folder” takes 20 minutes

•  One load: 90 minutes

A B C D

Sequential Laundry

•  Sequential laundry takes 6 hours for 4 loads •  If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6PM 7 8 9 10 11 Midnight

TaskOrder

Time

Pipelined Laundry Start Work ASAP

•  Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6PM 7 8 9 10 11 Midnight

TaskOrder

Time

30 40 40 40 40 20

Sequential laundry takes 6 hours for 4 loads

Classic5-StagePipelineforaRISC

•  EachcyclethehardwarewilliniIateanewinstrucIonandwillbeexecuIngsomepartofthefivedifferentinstrucIons.–  OnecycleperinstrucIonvs5cycleperinstrucIon

Clock number

Instruction number 1 2 3 4 5 6 7 8 9

Instruction i IF ID EX MEM WB

Instruction i+1 IF ID EX MEM WB




PipelineandSuperscalar

14

AdvancedILP

•  DynamicSchedulingàOut-of-orderExecuIon•  SpeculaIonàIn-orderCommit•  SuperscalaràMulIpleIssue

Techniques Goals Implementa4on Addressing Approaches

DynamicScheduling

Out-of-orderexecu4on

Reserva4onSta4ons,Load/StoreBufferandCDB

Datahazards(RAW,WAW,WAR)

Registerrenaming

Specula4on In-ordercommit

BranchPredic4on(BHT/BTB)andReorderBuffer

Controlhazards(branch,func,excep4on)

Predic4onandmispredic4onrecovery

Superscalar/VLIW

Mul4pleissue

SocwareandHardware ToIncreaseCPI Bycompilerorhardware

Problemsoftradi4onalILPscaling

•  FundamentalcircuitlimitaIons1–  delays⇑asissuequeues⇑andmulI-portregisterfiles⇑–  increasingdelayslimitperformancereturnsfromwiderissue•  LimitedamountofinstrucIon-levelparallelism1

–  inefficientforcodeswithdifficult-to-predictbranches

•  Powerandheatstallclockfrequencies

16

[1]Thecaseforasingle-chipmulIprocessor,K.Olukotun,B.Nayfeh,L.Hammond,K.Wilson,andK.Chang,ASPLOS-VII,1996.

ILPimpacts

17

Simula4onsof8-issueSuperscalar

18

Power/heatdensitylimitsfrequency

19

•  Somefundamentalphysicallimitsarebeingreached

Wewillhavethis…

20

21

Revolu4onishappeningnow•  Chipdensityis

conInuingincrease~2xevery2years–  Clockspeedisnot–  Numberofprocessor

coresmaydoubleinstead

•  Thereisli:leornohiddenparallelism(ILP)tobefound

•  ParallelismmustbeexposedtoandmanagedbysoOware–  Nofreelunch

Source:Intel,MicrosoO(Su:er)andStanford(Olukotun,Hammond)

CurrentTrendsinArchitecture

•  CannotconInuetoleverageInstrucIon-Levelparallelism(ILP)–  Singleprocessorperformanceimprovementendedin2003

•  Recentmodelsforperformance:–  ToexploreData-levelparallelism(DLP)viaSIMDarchitectureandGPUs

–  ToexploreThread-levelparallelism(TLP)viaMIMD

–  Others

22

SIMD:SingleInstruc4on,Mul4pleData(DataLevelParallelism)

•  SIMDarchitecturescanexploitsignificantdata-levelparallelismfor:–  matrix-orientedscienIficcompuIng–  media-orientedimageandsoundprocessors•  SIMDismoreenergyefficientthanMIMD–  OnlyneedstofetchoneinstrucIonperdataoperaIon

processingmulIpledataelements–  MakesSIMDa:racIveforpersonalmobiledevices•  SIMDallowsprogrammertoconInuetothinksequenIally

InstrucIonsstream

processor processor processor processor

Data Data Data Data

Controlunit

SIMDParallelism

•  Threevaria4ons–  Vectorarchitectures(earlyage)–  SIMDextensions–  GraphicsProcessorUnits(GPUs)(dedicatedweeksforGPUs)

•  Forx86processors:–  ExpecttwoaddiIonalcoresperchipperyear(MIMD)–  SIMDwidthtodoubleeveryfouryears–  PotenIalspeedupfromSIMDtobetwicethatfromMIMD!

VectorArchitectures

•  VectorprocessorsabstractoperaIonsonvectors,e.g.replacethefollowingloop

by•  Somelanguagesofferhigh-levelsupportfortheseoperaIons(e.g.Fortran90ornewer)

for (i=0; i<n; i++) { a[i] = b[i] + c[i];

}

a = b + c; ADDV.D V10, V8, V6

VectorProgrammingModel

+ + + + + +

[0] [1] [VLR-1]

Vector Arithmetic Instructions ADDV v3, v1, v2 v3

v2 v1

Scalar Registers

r0

r15 Vector Registers

v0

v15

[0] [1] [2] [VLRMAX-1] VLR Vector Length Register

v1 Vector Load and Store Instructions LV v1, (r1, r2)

Base, r1 Stride in r2 Memory

Vector Register

VectorwasSupercomputers•  Epitomy:Cray-1,1976•  ScalarUnit–  Load/StoreArchitecture

•  VectorExtension–  VectorRegisters–  VectorInstrucIons

•  ImplementaIon–  HardwiredControl–  HighlyPipelinedFuncIonalUnits–  InterleavedMemorySystem–  NoDataCaches–  NoVirtualMemory

AXPY(64elements)(Y=a*X+Y)inMIPSandVMIPS

•  #instrs:–  6vs~600•  Pipelinestalls–  64xhigherbyMIPS•  Vectorchaining(forwarding)–  V1,V2,V3andV4

for (i=0; i<64; i++) Y[i] = a* X[i] + Y[i];

ThestarIngaddressesofXandYareinRxandRy,respecIvely

SIMDInstruc4ons

•  OriginallydevelopedforMulImediaapplicaIons•  SameoperaIonexecutedformulIpledataitems•  UsesafixedlengthregisterandparIIonsthecarrychaintoallowuIlizingthesamefuncIonalunitformulIpleoperaIons– E.g.a64bitaddercanbeuIlizedfortwo32-bitaddoperaIonssimultaneously

SIMDInstruc4ons

•  MMX(Mult-MediaExtension)-1996–  ExisIng64bitfloaIngpointregistercouldbeusedforeight8-

bitoperaIonsorfour16-bitoperaIons•  SSE(StreamingSIMDExtension)–1999–  SuccessortoMMXinstrucIons–  Separate128-bitregistersaddedforsixteen8-bit,eight16-bit,

orfour32-bitoperaIons•  SSE2–2001,SSE3–2004,SSE4-2007–  AddedsupportfordoubleprecisionoperaIons•  AVX(AdvancedVectorExtensions)-2010–  256-bitregistersadded

AXPY

•  256-bitSIMDexts–  4doubleFP

•  MIPS:578insts•  SIMDMIPS:149–  4×reducIon

•  VMIPS:6instrs–  100×reducIon

for (i=0; i<64; i++) Y[i] = a* X[i] + Y[i];

StateoftheArt:IntelXeonPhiManycoreVectorCapability

•  IntelXeonPhiKnightCorner,2012,~60cores,4-waySMT•  IntelXeonPhiKnightLanding,2016,~60cores,4-waySMTandHBM–  h:p://www.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.25-Tuesday-

Epub/HC27.25.70-Processors-Epub/HC27.25.710-Knights-Landing-Sodani-Intel.pdf

h:p://primeurmagazine.com/repository/PrimeurMagazine-AE-PR-12-14-32.pdf

StateoftheArt:ARMScalableVectorExtensions(SVE)

•  AnnouncedinAugust2016–  h:ps://community.arm.com/groups/processors/blog/

2016/08/22/technology-update-the-scalable-vector-extension-sve-for-the-armv8-a-architecture

–  h:p://www.hotchips.org/wp-content/uploads/hc_archives/hc28/HC28.22-Monday-Epub/HC28.22.10-GPU-HPC-Epub/HC28.22.131-ARMv8-vector-Stephens-Yoshida-ARM-v8-23_51-v11.pdf

•  Beyondvectorarchitecturewelearned–  Vectorloop,predictandspeculaIon–  VectorLengthAgnosIc(VLA)programming

–  Checktheslide

Limita4onsofop4mizingasingleinstruc4onstream

•  Problem:withinasingleinstrucIonstreamwedonotfindenoughindependentinstrucIonstoexecutesimultaneouslydueto–  datadependencies–  limitaIonsofspeculaIveexecuIonacrossmulIplebranches–  difficulIestodetectmemorydependenciesamonginstrucIon

(aliasanalysis)•  Consequence:significantnumberoffuncIonalunitsareidlingat

anygivenIme•  QuesIon:CanwemaybeexecuteinstrucIonsfromanother

instrucIonsstream–  Anotherthread?–  Anotherprocess?

Thread-levelparallelism

•  ProblemsforexecuInginstrucIonsfrommulIplethreadsatthesameIme–  TheinstrucIonsineachthreadmightusethesameregister

names–  Eachthreadhasitsownprogramcounter•  VirtualmemorymanagementallowsfortheexecuIonofmulIplethreadsandsharingofthemainmemory

•  Whentoswitchbetweendifferentthreads:–  FinegrainmulIthreading:switchesbetweeneveryinstrucIon–  CoursegrainmulIthreading:switchesonlyoncostlystalls(e.g.

level2cachemisses)

ConvertThread-levelparallelismtoinstruc4on-levelparallelism

Time (

proc

esso

r cyc

le) Superscalar Fine-Grained Coarse-Grained

Simultaneous Multithreading

Thread 1 Thread 2

Thread 3 Thread 4

Thread 5 Idle slot

ILPtoDoTLP:e.g.SimultaneousMul4-Threading

•  Workswellif–  Numberofcomputeintensivethreadsdoesnotexceedthenumberof

threadssupportedinSMT–  ThreadshavehighlydifferentcharacterisIcs(e.g.onethreaddoingmostly

integeroperaIons,anothermainlydoingfloaIngpointoperaIons)•  Doesnotworkwellif–  ThreadstrytouIlizethesamefuncIonunits•  e.g.adualprocessorsystem,eachprocessorsupporIng2threadssimultaneously(OSthinksthereare4processors)

•  2computeintensiveapplicaIonprocessesmightenduponthesameprocessorinsteadofdifferentprocessors(OSdoesnotseethedifferencebetweenSMTandrealprocessors!)

Power,FrequencyandILP

Note:EvenMoore’sLawisendingaround2021:h:p://spectrum.ieee.org/semiconductors/devices/transistors-could-stop-shrinking-in-2021h:ps://www.technologyreview.com/s/601441/moores-law-is-dead-now-what/h:p://www.forbes.com/sites/Imworstall/2016/07/26/economics-is-important-the-end-of-moores-law

CPUfrequencyincreasewasfla:enedaround2000-2005Twomainreasons:1.  LimitedILPand2.  PowerconsumpIonandheat

dissipaIon

History–Past(2000)andToday

Flynn’sTaxonomy

h:ps://en.wikipedia.org/wiki/Flynn%27s_taxonomy

✔✔

✖

ExamplesofMIMDMachines•  SymmetricShared-Memory

MulIprocessor(SMP)–  MulIpleprocessorsinboxwith

sharedmemorycommunicaIon–  CurrentMulIcorechipslikethis–  EveryprocessorrunscopyofOS•  Distributed/Non-uniformShared-

MemoryMulIprocessor–  MulIpleprocessors•  Eachwithlocalmemory•  generalscalablenetwork

–  Extremelylight“OS”onnodeprovidessimpleservices•  Scheduling/synchronizaIon

–  Network-accessiblehostforI/O•  Cluster–  Manyindependentmachine

connectedwithgeneralnetwork–  CommunicaIonthroughmessages

P P P P

Bus

Memory

P/M P/M P/M P/M

P/M P/M P/M P/M

P/M P/M P/M P/M

P/M P/M P/M P/M

Host

Network

Symmetric(Shared-Memory)Mul4processors(SMP)

•  Smallnumbersofcores–  Typicallyeightorfewer,andno

morethan32inmostcases•  Shareasinglecentralizedmemorythatallprocessorshaveequalaccessto,–  Hencethetermsymmetric.•  AllexisIngmulIcoresareSMPs.

•  Alsocalleduniformmemoryaccess(UMA)mulIprocessors–  allprocessorshaveauniform

latency

CentralizedSharedMemorySystem(I)

•  MulI-coreprocessors–  Typicallyconnectedoveracache,–  PreviousSMPsystemsweretypicallyconnectedoverthemain

memory•  IntelX7350quad-core(Tigerton)–  PrivateL1cache:32KBinstrucIon,32KBdata–  SharedL2cache:4MBunifiedcache

CoreL1

CoreL1

sharedL2

CoreL1

CoreL1

sharedL2

1066MHzFSB

CentralizedSharedMemorySystem(SMP)(II)

•  IntelX7350quad-core(Tigerton)mulI-processorconfiguraIon

C0

C1

L2

C8

C9

L2

C2

C3

L2

C10

C11

L2

C4

C5

L2

C12

C13

L2

C6

C7

L2

C14

C15

L2

Socket0 Socket1 Socket2 Socket3

MemoryControllerHub(MCH)

Memory Memory Memory Memory

8GB/s8GB/s8GB/s8GB/s

DistributedShared-MemoryMul4processor•  Largeprocessorcount–  64to1000s•  Distributedmemory–  Remotevslocalmemory–  Longvsshortlatency–  Highvslowlatency

§  Interconnec4onnetwork–  Bandwidth,topology,etc

§  Nonuniformmemoryaccess(NUMA)

§  EachprocessormayhaslocalI/O

DistributedShared-MemoryMul4processor(NUMA)

•  Reducesthememorybo:leneckcomparedtoSMPs•  Moredifficulttoprogramefficiently–  E.g.firsttouchpolicy:dataitemwillbelocatedinthememory

oftheprocessorwhichusesadataitemfirst•  Toreduceeffectsofnon-uniformmemoryaccess,cachesareoOenused–  ccNUMA:cache-coherentnon-uniformmemoryaccess

architectures•  Largestexampleasoftoday:SGIOriginwith512processors

Shared-MemoryMul4processor

•  SMPandDSMareallsharedmemorymulIprocessors–  UMAorNUMA•  MulIcoreareSMPsharedmemory•  MostmulI-CPUmachinesareDSM–  NUMA

•  SharedAddressSpace(VirtualAddressSpace)–  Notalwayssharedmemory

CurrentTrendsinComputerArchitecture•  CannotconInuetoleverageILP–  Singleprocessorperformanceimprovementendedin2003

•  Currentmodelsforperformance:–  ToexploreData-levelparallelism(DLP)viaSIMDarchitecture(vector,SIMD

extensionsandGPUs)–  ToexploreThread-levelparallelism(TLP)viaMIMD

–  Heterogeneity:integratemul4pleanddifferentarchitecturestogetherinchip/systemlevel

•  Emergingarchitectures–  Domain-specificarchitectures:DeepLearningPU(e.g.TPU,etc)–  E.g.MachineLearningPullsProcessorArchitecturesontoNewPath•  hJps://www.top500.org/news/machine-learning-pulls-processor-architectures-onto-new-path/

Theserequireexplicitrestructuringoftheapplica4onßParallelProgramming

48

The“Future”ofMoore’sLaw

•  ThechipsaredownforMoore’slaw–  h:p://www.nature.com/news/the-chips-are-down-for-moore-

s-law-1.19338•  SpecialReport:50YearsofMoore'sLaw–  h:p://spectrum.ieee.org/staIc/special-report-50-years-of-

moores-law•  Moore’slawreallyisdeadthisIme–  h:p://arstechnica.com/informaIon-technology/2016/02/

moores-law-really-is-dead-this-Ime/•  RebooIngtheITRevoluIon:ACalltoAcIon(SIA/SRC,2015)–  h:ps://www.semiconductors.org/clientuploads/Resources/

RITR%20WEB%20version%20FINAL.pdf

49

lecture 02: parallel architecture

Documents