![Page 1: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/1.jpg)
CME213
EricDarve
SPRING 2017
![Page 2: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/2.jpg)
2
MAIN DEBUGGING TECHNIQUES
Awholeclasswouldbeneededonthistopic!1. Useassert:testconditionsonvariables:equality,inequality,
magnitude2. Printout,butassertsarefundamentallybetter3. Talktoyourduck4. Usetheoreticalresults,e.g.,convergencerate5. Testagainstreferencecode6. Manufacturedsolution:compareagainstknownsolutions7. Incrementalchanges:testaftereachchange(regressiontesting)8. Unit/moduletests9. Testinputsofincreasingdifficulty;codecoverage10. Simplifyproblemtominimalbuggyexample11. Dichotomy,divide-and-conquer12. Askpiazza,andyourTAs/instructor.
![Page 3: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/3.jpg)
PERFORMANCE METRICS
![Page 4: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/4.jpg)
4
WHY PERFORMANCE METRICS?
Understandingtheperformanceofacodeisimportant:● todevelopefficientcode● understandthebottlenecksofacode● comparealgorithms inameaningfulway,e.g.,matrix-
vectorproductsusingdifferentpartitioningschemesforthematrix.
Thetotalruntimecanbebrokendown,generallyspeaking,intothefollowingcategories:● Localcomputations● Dataexchangeandcommunication● Idletime(loadimbalance,spurioussynchronization)
![Page 5: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/5.jpg)
5
THE BASIC CONCEPT:SPEED-UP
● Thisquantitymeasureshowmuchfasterthecoderunsbecauseweareusingmanyprocesses.
● Define:
theoptimal(reference)runningtimewithasingleprocess.● Define:
therunningtimewithpprocesses.● Thespeed-upisthentheratio:
● Weexpectthisnumbertogoupaspaswekeepincreasingthenumberofprocesses.
![Page 6: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/6.jpg)
6
AMDAHL'S LAW● Althoughthisisacrudemodel,itcanprovideageneralsenseof
whatspeed-upcanbeachieved.Thisisagoodmeasuretounderstandhowmuchofthecodeoneshouldtrytoparallelize.
● Assumethatafractionfofthecodeisexecutedsequentially.Thenthespeed-upisgivenby:
● Thismeansthatthespeed-uphasanupperbound.Theefficiencymustgoto0eventually. Inmostcases,thisstatementistrueaspincreases.
● Howeverwearesavedbyanotherresult:foftendecreaseswithn,thatisthefractionofparallelworkincreaseswithn.
● ThisiswhywecanstillbenefitfromPetaandExascalemachineswith100,000+cores.
![Page 7: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/7.jpg)
7
PROFILING AND CUDACOMPUTING
● AsimilarsituationariseswithCUDA.1. Profileyourcode.Kernel1:90%2. Portkernel1toGPU.Kernel1:10%.Kernel2becomes80%
now.Furtheroptimizingkernel1willnolongeryieldanybenefit.
● Remembertoiteratebetweenprofilingandoptimizingyourcode.
● Always,focusonthelongestkernelasgivenbyyourprofiler.
![Page 8: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/8.jpg)
8
AMORE APPROPRIATE CONCEPT:EFFICIENCY
● Thepreviousdefinitionhasaproblem.Aspincreases,wewanttoknowwhetherthespeed-upscalesaspornot.
● Thismightbedifficulttoassessfromaplot.Ideally,thespeed-upisastraightline.
● Itisthereforemoreconvenient tolookattheefficiency:
● Ideallythatquantityissimplyaconstantaspincreases.Thatiseasiertoreadfromaplot.
● Themaximumvalueforefficiencyis1 (exceptinsomerarecircumstancesbecauseofcacheeffects).
![Page 9: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/9.jpg)
9
EFFICIENCY PLOTS
Typicalbehavioroftheefficiencyasthenumberofprocessesincreasesorastheproblemsize(amountofcomputationalworktoperform)increases.
![Page 10: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/10.jpg)
10
EXAMPLE 1:DOT PRODUCT
Two-stepalgorithm:
1. Calculatelocaldotproduct:
1. Useaspanningtreeforthefinalreduction:ln2 ppassesarerequired.
![Page 11: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/11.jpg)
11
![Page 12: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/12.jpg)
12
EXAMPLE 1:DOT PRODUCT
Totalruntimewithoneprocess:
Totalruntimeinparallel:
![Page 13: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/13.jpg)
13
EFFICIENCY
• Efficiency:
• Theefficiencycanbemaintainedprovidedthatwedonotscalepfasterthan:
• Thismeansthatpcannotincreaseasfastas.• Iso-efficiencymeansthatpincreasesataratesuchthattheefficiencyremainsconstant.
• Thisisveryimportant.Goodalgorithmsaresuchthatpcanbeincreasedrapidly atiso-efficiency.
Parallel Sequential
![Page 14: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/14.jpg)
14
EXAMPLE 2:MATRIX-VECTOR PRODUCT WITH 1DPARTITIONING• Inthatcase,wecanmodeltheserialrunningtimeas:
• Parallelrunningtime(comp+comm):
• Efficiency:
• Iso-efficiencyrequires:
![Page 15: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/15.jpg)
![Page 16: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/16.jpg)
16
EXAMPLE 2:MATRIX-VECTOR PRODUCT WITH 1DPARTITIONING• Inthatcase,wecanmodeltheserialrunningtimeas:
• Parallelrunningtime(comp+comm):
• Efficiency:
• Iso-efficiencyrequires:
• Thisisactuallynotverygood.Weexpectthatpshouldincreaseliken2roughly,becausethisistheamountofworktodo.
![Page 17: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/17.jpg)
17
EXAMPLE 2:MATRIX-VECTOR PRODUCT WITH 2DPARTITIONING
• Recallthatwepreviouslysaidthat2Dpartitioningisbetter.Let’sseeiftheorysupportsourclaim.
• Computation:
• Sendbtodiagonalprocesses:
• Broadcastbineachcolumn:
• Reductionacrosscolumns(samerunningtimeasbroadcastbecausetheseoperationsaredualofoneanother):
![Page 18: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/18.jpg)
18
ISO-EFFICIENCY
• Withthepreviousresults:
• Iso-efficiency:
![Page 19: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/19.jpg)
![Page 20: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/20.jpg)
20
ISO-EFFICIENCY
• Withthepreviousresults:
• Iso-efficiency:
• Thisismuchbetterthanwiththe1Dpartitioning.• Wecanincreasepmorerapidlyatiso-efficiency.• Anotherinterpretationisthatforagivennumberofprocesses,thisschemeisfaster.Practically,ithaslesscommunication.
![Page 21: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/21.jpg)
21
ISO-EFFICIENCY PLOTS
● Weplotforthetwoalgorithmsthevalueofpasafunctionofnsuchthat iso-efficiencyismaintained.
● Largervaluesofparebetter.● Thismeansimprovedscalability.
Largerpatiso-efficiency=algorithmismorescalable
Red=higherefficiencythanblue=runsfaster
![Page 22: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/22.jpg)
22
SUMMARY OF COMMUNICATION TIMES
Operation Hypercube timeOne-to-all broadcastAll-to-one reduction min((ts + twm) log p, 2(ts log p + twm))
All-to-all broadcastAll-to-all reduction ts log p + twm (p-1)
All-reduce min((ts + twm) log p, 2(ts log p + twm))Scatter, Gather ts log p + twm (p-1)All-to-all personalized (ts + twm) (p-1)Circular shift ts + twm
● m: size of message● p: number of processes● ts: latency● tw: bandwidth
![Page 23: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/23.jpg)
MATRIX-MATRIX PRODUCTS
![Page 24: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/24.jpg)
24
MATRIX-MATRIX PRODUCTS
Algorithminpseudo-code:
for i=0:n-1 dofor j=0:n-1 do
C(i,j) = 0;for k=0:n-1
C(i,j) += A(i,k) * B(k,j);end
endend
![Page 25: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/25.jpg)
25
NAÏVE BLOCK OPERATIONS
• Algorithmproceedsbydoingblockoperations.
• Forpprocesses,wecreatepblocksofsizen/p1/2.• Simpleapproach:
• all-to-allbroadcastineachrowofA• all-to-allbroadcastineachcolumnofB• Performcalculationwithlocaldataoneachprocess
• Iso-efficiency:
Matrix A Matrix B
![Page 26: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/26.jpg)
26
CANNON’S ALGORITHM
● Therearetwoissueswiththissimplealgorithm:▪ Weshouldbeabletoincreasepcloserton3 atiso-
efficiency.▪ Thisalgorithmrequiresalotofmemorysinceaprocess
needstostoreanentireblockrowofAandblockcolumnofB.
● Cannon’salgorithmallowsreducingthememoryfootprint.● ItworksbycleverlyshufflingtheblocksofAandBsuchthat
eachprocessneverstoresmorethanoneblockofAandB.● TheblocksofAarerotatedinsideeachrowwhiletheblocks
ofBarerotatedinsideeachcolumn.● Thetrickistostartwiththerightalignment.● WewillseelaterthattheDekel-Nassimi-Sahni (DNS)
algorithmallowsimprovingthescalability significantly.
![Page 27: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/27.jpg)
27
COMMUNICATION STEPS
A00B00
A01B01
A02B02
A03B03
A10B10
A11B11
A12B12
A13B13
A20B20
A21B21
A22B22
A23B23
A30B30
A31B31
A32B32
A33B33
A00B00
A01B11
A02B22
A03B33
A11B10
A12B21
A13B32
A10B03
A22B20
A23B31
A20B02
A21B13
A33B30
A30B01
A31B12
A32B23
● Afterthefirstcommunicationstep,eachprocesshasdatatoperformitsfirstblockmultiplication.
● Akeypointisthat,fromthenon,onlyasimplecommunicationwithneighborsisrequiredateachstep.
![Page 28: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/28.jpg)
28
FIRST SHIFT
● BblocksareshiftedupwhileAblocksareshiftedleft.● Eachprocesshasnowthenexttwoblocksrequiredfortheproduct.● Asimilarsecondandthirdshiftsarerequiredtocompletethe
calculation:shiftBupandAleft.
A00B00
A01B11
A02B22
A03B33
A11B10
A12B21
A13B32
A10B03
A22B20
A23B31
A20B02
A21B13
A33B30
A30B01
A31B12
A32B23
A01B10
A02B21
A03B32
A00B03
A12B20
A13B31
A10B02
A11B13
A23B30
A20B01
A21B12
A22B23
A30B00
A31B11
A32B22
A33B33
![Page 29: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/29.jpg)
29
SCALABILITY:INCREASING THE NUMBER OF PROCESSES
Dimensionofmatrix:1536,blocksize:1536,numberofprocsalongbothdims:11Thecalculationtook16.068481seconds;pxruntime=16.068481Dimensionofmatrix:1536,blocksize:768,numberofprocsalongbothdims:22Thecalculationtook4.514676seconds;pxruntime=18.058703Dimensionofmatrix:1536,blocksize:512,numberofprocsalongbothdims:33Thecalculationtook2.262061seconds;pxruntime=20.358550Dimensionofmatrix:1536,blocksize:384,numberofprocsalongbothdims:44Thecalculationtook2.193327seconds;pxruntime=35.093235Dimensionofmatrix:1536,blocksize:256,numberofprocsalongbothdims:66Thecalculationtook0.472822seconds;pxruntime=17.021599Dimensionofmatrix:1536,blocksize:192,numberofprocsalongbothdims:88Thecalculationtook0.174466seconds;pxruntime=11.165817
![Page 30: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/30.jpg)
30
SCALABILITY:FIXED NUMBER OF PROCESSES
Dimension of matrix: 1536, block size: 512
Number of procs along both dims: 3 3
1 node The calculation took 2.090732 seconds
p x runtime = 18.816589
2 nodes The calculation took 1.783496 seconds
p x runtime = 16.051465
3 nodes The calculation took 1.705040 seconds
p x runtime = 15.345360
5 nodes The calculation took 1.671645 seconds
p x runtime = 15.044806
9 nodes The calculation took 1.651060 seconds
p x runtime = 14.859539
![Page 31: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/31.jpg)
![Page 32: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/32.jpg)
32
CANNON’S ALGORITHM
• MPIcode:mmm/
• Withthisalgorithm,processesstoreonly2blocksatatime.
• Costofcommunicationisslightlydifferentfromnaïvealgorithmbutintheendtherunningtimesarecomparable.
• Theiso-efficiencycurveisveryclosetotheotheralgorithmwithpscalingasn2.
• Cannon’smainfeatureisthereducedmemoryfootprint.ThisisimportantasMPIcodesoftenneedalotofmemory.
![Page 33: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/33.jpg)
DEKEL-NASSIMI-SAHNI ALGORITHM
![Page 34: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/34.jpg)
34
DEKEL-NASSIMI-SAHNI ALGORITHM
● ThenumberofoperationsisO(n3).Thereforeweshouldbeabletouseuptoaboutp=O(n3)processes.TheDNSalgorithmachievesthis.
● Inthisalgorithm,eachprocessPijk calculatesoneproductA(i,k)*B(k,j)
wherethesearematrixblocks.
![Page 35: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/35.jpg)
35
INITIAL DISTRIBUTION OF DATA
● Weusep=n3 processes.
● WewanttocomputeC(i,j)+=A(i,k)*B(k,j)
● Process(i,j,k)willdothismultiplication.
● Then,areductionoverkisusedtogetthefinalC(i,j).
● Initially,theprocesses(i,j,0)holdthedataforAandB.
![Page 36: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/36.jpg)
36
STEP 1
● Let’sfocusonA.Bissimilar.● Weneedtopropagatethe
blocksofAsuchthatprocess(i,j,k)hasA(i,k).
● Soprocess(i,j,0)(hasA(i,j))needstosenditsblocktoallprocesses(i,*,j).
● ThefirstcommunicationisaSendfrom(i,j,0)to(i,j,j).
● Dothisforalliandj. i
jk
![Page 37: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/37.jpg)
37
STEP 2
● Wenowdoabroadcastinsideeachcolumn.
● (i,j,j)broadcaststoall(i,*,j).● Abroadcastwitha
communicatorisusedforthis.
i
jk
![Page 38: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/38.jpg)
38
STEP 3
● Atlast,wecancomputesomething!● Process(i,j,k)hasA(i,k)andB(k,j).
![Page 39: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/39.jpg)
39
STEP 4
● Wefinallydoareductioninsideeachverticalcolumn(indexk).● Process(i,j,0)hasC(i,j)intheend.
Reductionalongk
![Page 40: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p](https://reader034.vdocuments.mx/reader034/viewer/2022051903/5ff4bce676176c06d400a4cc/html5/thumbnails/40.jpg)
40
WHY IS DNSA BETTER ALGORITHM?
• Theiso-efficiencyis
• InCannon,p=O(n2),whereasinDNS,p=O(n3/(lnn)3).• Whatdoesitmeanthattheiso-efficiencycurveofDNSisbetterthanCannon?
• BecauseDNSisabletouselargerblocks,fewercommunicationsarerequired.
• Thealgorithmisthereforemoreefficient(hasbetterscalability).
• Communication=size2.Computation=size3.Largeblocksarefavorable.