lecture 11: distributed memory machines and …...lecture 11: distributed memory machines and...
TRANSCRIPT
![Page 1: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/1.jpg)
Lecture11:DistributedMemoryMachinesandProgramming
1
CSCE569ParallelComputing
DepartmentofComputerScienceandEngineeringYonghong Yan
[email protected]://cse.sc.edu/~yanyh
![Page 2: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/2.jpg)
Topics
• Introduction• Programmingonsharedmemorysystem(Chapter7)
– OpenMP• Principlesofparallelalgorithmdesign(Chapter3)• Programmingonlargescalesystems(Chapter6)
– MPI(pointtopointandcollectives)– IntroductiontoPGASlanguages,UPCandChapel
• Analysisofparallelprogramexecutions(Chapter5)– PerformanceMetricsforParallelSystems
• ExecutionTime,Overhead,Speedup,Efficiency,Cost– ScalabilityofParallelSystems– Useofperformancetools
2
![Page 3: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/3.jpg)
Acknowledgement
• SlidesadaptedfromU.C.BerkeleycourseCS267/EngC233ApplicationsofParallelComputersbyJimDemmel andKatherineYelick,Spring2011– http://www.cs.berkeley.edu/~demmel/cs267_Spr11/
• Andmaterialsfromvarioussources
3
![Page 4: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/4.jpg)
SharedMemoryParallelSystems:MulticoreandMulti-CPU
• a
4
![Page 5: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/5.jpg)
Node-levelArchitectureandProgramming
• Sharedmemorymultiprocessors:multicore,SMP,NUMA– Deepmemoryhierarchy,distantmemorymuchmore
expensivetoaccess.– Machinesscaleto10sor100sofprocessors– InstructionLevelParallelism(ILP),DataLevelParallelism(DLP)
andThreadLevelParallelism(TLP)• Programming
– OpenMP,PThreads,Cilkplus,etc
5
![Page 6: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/6.jpg)
HPCArchitectures(TOP500,Nov2014)
6
![Page 7: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/7.jpg)
Outline
• ClusterIntroduction• DistributedMemoryArchitectures
– Propertiesofcommunicationnetworks– Topologies– Performancemodels
• ProgrammingDistributedMemoryMachinesusingMessagePassing– OverviewofMPI– Basicsend/receiveuse– Non-blockingcommunication– Collectives
7
![Page 8: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/8.jpg)
Clusters
• A groupoflinkedcomputers,workingtogethercloselysothatinmanyrespectstheyformasinglecomputer.
• Consistsof– Nodes(Front+computing)– Network– Software:OSandmiddleware
8
Node Node Node…
High Speed Local Network
Cluster Middle ware
…
![Page 9: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/9.jpg)
9
![Page 10: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/10.jpg)
10
![Page 11: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/11.jpg)
Top10ofTop500
11
http://www.top500.org/lists/2016/06/
![Page 12: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/12.jpg)
( Ethernet,Infiniband….) + (MPI)
HPCBeowulfCluster
• Masternode:orservice/frontnode(usedtointeractwithuserslocallyorremotely)
• ComputingNodes:performancecomputations• Interconnectandswitchbetweennodes:e.g.G/10G-bitEthernet,
Infiniband• Inter-nodeprogramming
– MPI(MessagePassingInterface)isthemostcommonlyusedone.
12
![Page 13: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/13.jpg)
NetworkSwitch
13
![Page 14: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/14.jpg)
NetworkInterfaceCard(NIC)
14
![Page 15: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/15.jpg)
Outline
• ClusterIntroduction• DistributedMemoryArchitectures
– Propertiesofcommunicationnetworks– Topologies– Performancemodels
• ProgrammingDistributedMemoryMachinesusingMessagePassing– OverviewofMPI– Basicsend/receiveuse– Non-blockingcommunication– Collectives
15
![Page 16: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/16.jpg)
NetworkAnalogy
• Tohavealargenumberofdifferenttransfersoccurringatonce,youneedalargenumberofdistinctwires– Notjustabus,asinsharedmemory
• Networksarelikestreets:– Link =street.– Switch =intersection.– Distances (hops)=numberofblockstraveled.– Routingalgorithm =travelplan.
• Properties:– Latency:howlongtogetbetweennodesinthenetwork.– Bandwidth:howmuchdatacanbemovedperunittime.
• Bandwidthislimitedbythenumberofwiresandtherateatwhicheachwirecanacceptdata.
16
![Page 17: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/17.jpg)
LatencyandBandwidth
• Latency:Timetotravelfromonelocationtoanotherforavehicle– Vehicletype(largeorsmallmessages)– Road/trafficcondition,speed-limit,etc
• Bandwidth:Howmanycarsandhowfasttheycantravelfromonelocationtoanother– Numberoflanes
17
![Page 18: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/18.jpg)
PerformancePropertiesofaNetwork:Latency
• Diameter:themaximum(overallpairsofnodes)oftheshortestpathbetweenagivenpairofnodes.
• Latency: delaybetweensendandreceivetimes– Latencytendstovarywidelyacrossarchitectures– Vendorsoftenreporthardwarelatencies (wiretime)– Applicationprogrammerscareaboutsoftwarelatencies (user
programtouserprogram)• Observations:
– Latenciesdifferby1-2ordersacrossnetworkdesigns– Software/hardwareoverheadatsource/destinationdominatecost
(1s-10susecs)– Hardwarelatencyvarieswithdistance(10s-100snsec perhop)butis
smallcomparedtooverheads• Latencyiskeyforprogramswithmanysmallmessages
18
I second = 10^3 millseconds (ms) = 10^6 microseconds (us) = 10^9 nanoseconds (ns)
![Page 19: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/19.jpg)
LatencyonSomeMachines/Networks
8-byte Roundtrip Latency
14.6
6.6
22.1
9.6
18.5
24.2
0
5
10
15
20
25
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Rou
ndtri
p La
tenc
y (u
sec)
MPI ping-pong
19
• Latenciesshownarefromaping-pong testusingMPI• Theseareroundtrip numbers:manypeopleuse½ofroundtrip timeto
approximate1-waylatency(whichcan’teasilybemeasured)
![Page 20: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/20.jpg)
EndtoEndLatency(1/2roundtrip)OverTime
6.9745
36.34
7.2755
3.3
12.08059.25
2.6
6.905
11.027
4.81
nCube/2
nCube/2
CM5
CM5 CS2
CS2
SP1
SP2
Paragon
T3DT3D
SPP
KSR
SPP
Cenju3
T3E
T3E18.916
SP-Power3
Quadrics
Myrinet
Quadrics
1
10
100
1990 1995 2000 2005 2010Year (approximate)
usec
20
• Latencyhasnotimprovedsignificantly,unlikeMoore’sLaw•T3E(shmem)waslowestpoint– in1997
Data from Kathy Yelick, UCB and NERSC
![Page 21: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/21.jpg)
PerformancePropertiesofaNetwork:Bandwidth
• Thebandwidthofalink=#wires/time-per-bit• BandwidthtypicallyinGigabytes/sec(GB/s),i.e.,8*220 bitspersecond
• Effectivebandwidth isusuallylowerthanphysicallinkbandwidthduetopacketoverhead.
21
Routing and control header
Data payload
Error code
Trailer
• Bandwidth is important for applications with mostly large messages
![Page 22: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/22.jpg)
BandwidthonSomeNetworks
Flood Bandwidth for 2MB messages
1504
630
244
857225
610
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Elan3/Alpha Elan4/IA64 Myrinet/x86 IB/G5 IB/Opteron SP/Fed
Perc
ent H
W p
eak
(BW
in M
B) MPI
22
• Floodbandwidth(throughputofback-to-back2MBmessages)
![Page 23: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/23.jpg)
BandwidthChart
230
50
100
150
200
250
300
350
400
2048 4096 8192 16384 32768 65536 131072Message Size (Bytes)
Band
wid
th (M
B/se
c)
T3E/MPIT3E/ShmemIBM/MPIIBM/LAPICompaq/PutCompaq/GetM2K/MPIM2K/GMDolphin/MPIGiganet/VIPLSysKonnect
Data from Mike Welcome, NERSC
Note:bandwidthdependsonSW,notjustHW
![Page 24: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/24.jpg)
PerformancePropertiesofaNetwork:BisectionBandwidth
• Bisectionbandwidth:bandwidthacrosssmallestcutthatdividesnetworkintotwoequalhalves
• Bandwidthacross“narrowest” partofthenetwork
24
bisection cut
not a bisectioncut
bisectionbw=linkbw bisectionbw=sqrt(n)*linkbw
• Bisectionbandwidthisimportantforalgorithmsinwhichallprocessorsneedtocommunicatewithallothers
![Page 25: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/25.jpg)
OtherCharacteristicsofaNetwork
• Topology(howthingsareconnected)– Crossbar,ring,2-Dand3-Dmeshortorus,hypercube,tree,
butterfly,perfectshuffle....• Routingalgorithm:
– Examplein2Dtorus:alleast-westthenallnorth-south(avoidsdeadlock).
• Switchingstrategy:– Circuitswitching:fullpathreservedforentiremessage,likethe
telephone.– Packetswitching:messagebrokenintoseparately-routed
packets,likethepostoffice.• Flowcontrol (whatifthereiscongestion):
– Stall,storedatatemporarilyinbuffers,re-routedatatoothernodes,tellsourcenodetotemporarilyhalt,discard,etc.
25
![Page 26: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/26.jpg)
NetworkTopology
• Inthepast,therewasconsiderableresearchinnetworktopologyandinmappingalgorithmstotopology.– Keycosttobeminimized:numberof“hops” betweennodes
(e.g.“storeandforward”)– Modernnetworkshidehopcost(i.e.,“wormholerouting”),so
topologyisnolongeramajorfactorinalgorithmperformance.• Example:OnIBMSPsystem,hardwarelatencyvariesfrom0.5usec to1.5usec,butuser-levelmessagepassinglatencyisroughly36usec.
• Needsomebackgroundinnetworktopology– Algorithmsmayhaveacommunicationtopology– Topologyaffectsbisectionbandwidth.
26
![Page 27: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/27.jpg)
LinearandRingTopologies
• Lineararray
– Diameter=n-1;averagedistance~n/3.– Bisectionbandwidth=1(inunitsoflinkbandwidth).
• TorusorRing
– Diameter=n/2;averagedistance~n/4.– Bisectionbandwidth=2.– Naturalforalgorithmsthatworkwith1Darrays.
27
![Page 28: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/28.jpg)
MeshesandTori
• Twodimensionalmesh– Diameter=2*(sqrt(n)– 1)– Bisectionbandwidth=sqrt(n)
• Twodimensionaltorus– Diameter=sqrt(n)– Bisectionbandwidth=2*sqrt(n)
28
•Generalizestohigherdimensions• CrayXT(eg Franklin@NERSC)uses3DTorus
• Naturalforalgorithmsthatworkwith2Dand/or3Darrays(matmul)
![Page 29: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/29.jpg)
Hypercubes
• Numberofnodesn=2dfordimensiond.– Diameter=d.– Bisectionbandwidth=n/2.
• 0d1d2d 3d 4d
• Popularinearlymachines(InteliPSC,NCUBE).– Lotsofcleveralgorithms.
• Greycode addressing:– Eachnodeconnectedto
otherswith1bitdifferent.
29
001000
100
010 011
111
101
110
![Page 30: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/30.jpg)
Trees
• Diameter=logn.• Bisectionbandwidth=1.• Easylayoutasplanargraph.• Manytreealgorithms(e.g.,summation).• Fattreesavoidbisectionbandwidthproblem:
– More(orwider)linksneartop.– Example:ThinkingMachinesCM-5.
30
![Page 31: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/31.jpg)
Butterflies
• Diameter=logn.• Bisectionbandwidth=n.• Cost:lotsofwires.• UsedinBBNButterfly.• NaturalforFFT.
31
O 1O 1
O 1 O 1
butterfly switch multistage butterfly network
Ex: to get from proc 101 to 110,Compare bit-by-bit andSwitch if they disagree, else not
![Page 32: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/32.jpg)
TopologiesinRealMachines
32
Cray XT3 and XT4 3D Torus (approx)
Blue Gene/L 3D Torus
SGI Altix Fat tree
Cray X1 4D Hypercube*
Myricom (Millennium) Arbitrary
Quadrics (in HP Alpha server clusters)
Fat tree
IBM SP Fat tree (approx)
SGI Origin Hypercube
Intel Paragon (old) 2D Mesh
BBN Butterfly (really old) Butterfly
olde
rnew
er
Manyoftheseareapproximations:E.g.,theX1isreallya“quadbristledhypercube” andsomeofthefattreesarenotasfatastheyshouldbeatthetop
![Page 33: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/33.jpg)
PerformanceModels
33
![Page 34: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/34.jpg)
LatencyandBandwidthModel
• Timetosendmessageoflengthnisroughly
• Topologyisassumedirrelevant.• Oftencalled“a-bmodel” andwritten
• Usuallya >>b >>timeperflop.– Onelongmessageischeaperthanmanyshortones.
– Candohundredsorthousandsofflopsforcostofonemessage.• Lesson:Needlargecomputation-to-communicationratiotobe
efficient.
34
Time = latency + n*cost_per_word= latency + n/bandwidth
Time = a + n*b
a + n*b << n*(a + 1*b)
![Page 35: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/35.jpg)
Alpha-BetaParametersonCurrentMachines
• Thesenumberswereobtainedempirically
35
machine a bT3E/Shm 1.2 0.003T3E/MPI 6.7 0.003IBM/LAPI 9.4 0.003IBM/MPI 7.6 0.004Quadrics/Get 3.267 0.00498Quadrics/Shm 1.3 0.005Quadrics/MPI 7.3 0.005Myrinet/GM 7.7 0.005Myrinet/MPI 7.2 0.006Dolphin/MPI 7.767 0.00529Giganet/VIPL 3.0 0.010GigE/VIPL 4.6 0.008GigE/MPI 5.854 0.00872
a is latency in usecsb is BW in usecs per Byte
How well does the modelTime = a + n*b
predict actual performance?
![Page 36: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/36.jpg)
ModelTimeVaryingMessageSize&Machines
36
![Page 37: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/37.jpg)
MeasuredMessageTime
37
![Page 38: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/38.jpg)
LogP Model
• 4performanceparameters– L:latencyexperiencedineachcommunicationevent
• timetocommunicatewordorsmall#ofwords– o:send/recv overheadexperiencedbyprocessor
• timeprocessorfullyengagedintransmissionorreception– g:gapbetweensuccessivesendsorrecvs byaprocessor
• 1/g=communicationbandwidth– P:numberofprocessor/memorymodules
38
![Page 39: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/39.jpg)
LogP Parameters:Overhead&Latency
• Non-overlappingoverhead
• Sendandrecv overheadcanoverlap
39
P0
P1
osend
L
orecv
P0
P1
osend
orecv
EEL = End-to-End Latency= osend + L + orecv
EEL = f(osend, L, orecv)³ max(osend, L, orecv)
![Page 40: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/40.jpg)
LogP Parameters:gap
• TheGapisthedelaybetweensendingmessages
• Gapcouldbegreaterthansendoverhead– NICmaybebusyfinishingtheprocessing
oflastmessageandcannotacceptanewone.
– FlowcontrolorbackpressureonthenetworkmaypreventtheNICfromacceptingthenextmessagetosend.
• NooverlapÞ timetosendnmessages(pipelined)=
40
P0
P1
osendgap
gap
(osend + L + orecv - gap) + n*gap = α + n*β
![Page 41: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/41.jpg)
Results:EELandOverhead
41
0
5
10
15
20
25
T3E/M
PI
T3E/Shm
em
T3E/E-R
eg
IBM/MPI
IBM/LAPI
Quadri
cs/M
PI
Quadri
cs/Put
Quadri
cs/G
et
M2K/M
PI
M2K/G
M
Dolphin
/MPI
Gigane
t/VIPL
usec
Send Overhead (alone) Send & Rec Overhead Rec Overhead (alone) Added Latency
Data from Mike Welcome, NERSC
![Page 42: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/42.jpg)
SendOverheadOverTime
• Overheadhasnotimprovedsignificantly;T3Dwasbest– Lackofintegration;lackofattentioninsoftware
42
Myrinet2K
DolphinT3E
Cenju4
CM5
CM5
Meiko
MeikoParagon
T3D
Dolphin
Myrinet
SP3
SCI
Compaq
NCube/2
T3E0
2
4
6
8
10
12
14
1990 1992 1994 1996 1998 2000 2002Year (approximate)
usec
Data from Kathy Yelick, UCB and NERSC
![Page 43: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/43.jpg)
LimitationsoftheLogP Model
• TheLogP modelhasafixedcostforeachmessage– Thisisusefulinshowinghowtoquicklybroadcastasingleword– OtherexamplesalsointheLogP papers
• Forlargermessages,thereisavariationLogGP– Twogapparameters,oneforsmallandoneforlargemessages– Thelargemessagegapisthebinourpreviousmodel
• Notopologyconsiderations(includingnolimitsforbisectionbandwidth)– Assumesafullyconnectednetwork– OKforsomealgorithmswithnearestneighborcommunication,but
with“all-to-all” communicationweneedtorefinethisfurther• Thisisaflatmodel,i.e.,eachprocessorisconnectedtothe
network– Clustersofmulticoresarenotaccuratelymodeled
43
![Page 44: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/44.jpg)
Summary
• Latencyandbandwidtharetwoimportantnetworkmetrics– Latencymattersmoreforsmallmessagesthanbandwidth– Bandwidthmattersmoreforlargemessagesthanbandwidth– Time=a +n*b
• Communicationhasoverheadfrombothsendingandreceivingend– EEL = End-to-End Latency = osend + L + orecv
• Multiplecommunicationcanoverlap
44
![Page 45: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/45.jpg)
HistoricalPerspective
• Earlydistributedmemorymachineswere:– Collectionofmicroprocessors.– Communicationwasperformedusingbi-directionalqueues
betweennearestneighbors.• Messageswereforwardedbyprocessorsonpath.
– “Storeandforward” networking• Therewasastrongemphasisontopologyinalgorithms,inordertominimizethenumberofhops=minimizetime
45
![Page 46: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/46.jpg)
EvolutionofDistributedMemoryMachines
• Specialqueueconnectionsarebeingreplacedbydirectmemoryaccess(DMA):– Processorpacksorcopiesmessages.– Initiatestransfer,goesoncomputing.
• Wormholeroutinginhardware:– Specialmessageprocessorsdonotinterruptmainprocessorsalong
path.– Longmessagesendsarepipelined.– Processorsdon’twaitforcompletemessagebeforeforwarding
• Messagepassinglibrariesprovidestore-and-forwardabstraction:– Cansend/receivebetweenanypairofnodes,notjustalongone
wire.– Timedependsondistancesinceeachprocessoralongpathmust
participate.
46
![Page 47: Lecture 11: Distributed Memory Machines and …...Lecture 11: Distributed Memory Machines and Programming 1 CSCE 569 Parallel Computing Department of Computer Science and Engineering](https://reader035.vdocuments.mx/reader035/viewer/2022062919/5edf4d81ad6a402d666aa724/html5/thumbnails/47.jpg)
Outline
• ClusterIntroduction• DistributedMemoryArchitectures
– Propertiesofcommunicationnetworks– Topologies– Performancemodels
• ProgrammingDistributedMemoryMachinesusingMessagePassing– OverviewofMPI– Basicsend/receiveuse– Non-blockingcommunication– Collectives
47