lecture 21 - stanford universityweb.stanford.edu/class/cme213/files/lectures/lecture_21.pdf ·...
TRANSCRIPT
CME213
EricDarve
SPRING 2017
2
MAIN DEBUGGING TECHNIQUES
Awholeclasswouldbeneededonthistopic!1. Useassert:testconditionsonvariables:equality,inequality,
magnitude2. Printout,butassertsarefundamentallybetter3. Talktoyourduck4. Usetheoreticalresults,e.g.,convergencerate5. Testagainstreferencecode6. Manufacturedsolution:compareagainstknownsolutions7. Incrementalchanges:testaftereachchange(regressiontesting)8. Unit/moduletests9. Testinputsofincreasingdifficulty;codecoverage10. Simplifyproblemtominimalbuggyexample11. Dichotomy,divide-and-conquer12. Askpiazza,andyourTAs/instructor.
PERFORMANCE METRICS
4
WHY PERFORMANCE METRICS?
Understandingtheperformanceofacodeisimportant:● todevelopefficientcode● understandthebottlenecksofacode● comparealgorithms inameaningfulway,e.g.,matrix-
vectorproductsusingdifferentpartitioningschemesforthematrix.
Thetotalruntimecanbebrokendown,generallyspeaking,intothefollowingcategories:● Localcomputations● Dataexchangeandcommunication● Idletime(loadimbalance,spurioussynchronization)
5
THE BASIC CONCEPT:SPEED-UP
● Thisquantitymeasureshowmuchfasterthecoderunsbecauseweareusingmanyprocesses.
● Define:
theoptimal(reference)runningtimewithasingleprocess.● Define:
therunningtimewithpprocesses.● Thespeed-upisthentheratio:
● Weexpectthisnumbertogoupaspaswekeepincreasingthenumberofprocesses.
6
AMDAHL'S LAW● Althoughthisisacrudemodel,itcanprovideageneralsenseof
whatspeed-upcanbeachieved.Thisisagoodmeasuretounderstandhowmuchofthecodeoneshouldtrytoparallelize.
● Assumethatafractionfofthecodeisexecutedsequentially.Thenthespeed-upisgivenby:
● Thismeansthatthespeed-uphasanupperbound.Theefficiencymustgoto0eventually. Inmostcases,thisstatementistrueaspincreases.
● Howeverwearesavedbyanotherresult:foftendecreaseswithn,thatisthefractionofparallelworkincreaseswithn.
● ThisiswhywecanstillbenefitfromPetaandExascalemachineswith100,000+cores.
7
PROFILING AND CUDACOMPUTING
● AsimilarsituationariseswithCUDA.1. Profileyourcode.Kernel1:90%2. Portkernel1toGPU.Kernel1:10%.Kernel2becomes80%
now.Furtheroptimizingkernel1willnolongeryieldanybenefit.
● Remembertoiteratebetweenprofilingandoptimizingyourcode.
● Always,focusonthelongestkernelasgivenbyyourprofiler.
8
AMORE APPROPRIATE CONCEPT:EFFICIENCY
● Thepreviousdefinitionhasaproblem.Aspincreases,wewanttoknowwhetherthespeed-upscalesaspornot.
● Thismightbedifficulttoassessfromaplot.Ideally,thespeed-upisastraightline.
● Itisthereforemoreconvenient tolookattheefficiency:
● Ideallythatquantityissimplyaconstantaspincreases.Thatiseasiertoreadfromaplot.
● Themaximumvalueforefficiencyis1 (exceptinsomerarecircumstancesbecauseofcacheeffects).
9
EFFICIENCY PLOTS
Typicalbehavioroftheefficiencyasthenumberofprocessesincreasesorastheproblemsize(amountofcomputationalworktoperform)increases.
10
EXAMPLE 1:DOT PRODUCT
Two-stepalgorithm:
1. Calculatelocaldotproduct:
1. Useaspanningtreeforthefinalreduction:ln2 ppassesarerequired.
11
12
EXAMPLE 1:DOT PRODUCT
Totalruntimewithoneprocess:
Totalruntimeinparallel:
13
EFFICIENCY
• Efficiency:
• Theefficiencycanbemaintainedprovidedthatwedonotscalepfasterthan:
• Thismeansthatpcannotincreaseasfastas.• Iso-efficiencymeansthatpincreasesataratesuchthattheefficiencyremainsconstant.
• Thisisveryimportant.Goodalgorithmsaresuchthatpcanbeincreasedrapidly atiso-efficiency.
Parallel Sequential
14
EXAMPLE 2:MATRIX-VECTOR PRODUCT WITH 1DPARTITIONING• Inthatcase,wecanmodeltheserialrunningtimeas:
• Parallelrunningtime(comp+comm):
• Efficiency:
• Iso-efficiencyrequires:
16
EXAMPLE 2:MATRIX-VECTOR PRODUCT WITH 1DPARTITIONING• Inthatcase,wecanmodeltheserialrunningtimeas:
• Parallelrunningtime(comp+comm):
• Efficiency:
• Iso-efficiencyrequires:
• Thisisactuallynotverygood.Weexpectthatpshouldincreaseliken2roughly,becausethisistheamountofworktodo.
17
EXAMPLE 2:MATRIX-VECTOR PRODUCT WITH 2DPARTITIONING
• Recallthatwepreviouslysaidthat2Dpartitioningisbetter.Let’sseeiftheorysupportsourclaim.
• Computation:
• Sendbtodiagonalprocesses:
• Broadcastbineachcolumn:
• Reductionacrosscolumns(samerunningtimeasbroadcastbecausetheseoperationsaredualofoneanother):
18
ISO-EFFICIENCY
• Withthepreviousresults:
• Iso-efficiency:
20
ISO-EFFICIENCY
• Withthepreviousresults:
• Iso-efficiency:
• Thisismuchbetterthanwiththe1Dpartitioning.• Wecanincreasepmorerapidlyatiso-efficiency.• Anotherinterpretationisthatforagivennumberofprocesses,thisschemeisfaster.Practically,ithaslesscommunication.
21
ISO-EFFICIENCY PLOTS
● Weplotforthetwoalgorithmsthevalueofpasafunctionofnsuchthat iso-efficiencyismaintained.
● Largervaluesofparebetter.● Thismeansimprovedscalability.
Largerpatiso-efficiency=algorithmismorescalable
Red=higherefficiencythanblue=runsfaster
22
SUMMARY OF COMMUNICATION TIMES
Operation Hypercube timeOne-to-all broadcastAll-to-one reduction min((ts + twm) log p, 2(ts log p + twm))
All-to-all broadcastAll-to-all reduction ts log p + twm (p-1)
All-reduce min((ts + twm) log p, 2(ts log p + twm))Scatter, Gather ts log p + twm (p-1)All-to-all personalized (ts + twm) (p-1)Circular shift ts + twm
● m: size of message● p: number of processes● ts: latency● tw: bandwidth
MATRIX-MATRIX PRODUCTS
24
MATRIX-MATRIX PRODUCTS
Algorithminpseudo-code:
for i=0:n-1 dofor j=0:n-1 do
C(i,j) = 0;for k=0:n-1
C(i,j) += A(i,k) * B(k,j);end
endend
25
NAÏVE BLOCK OPERATIONS
• Algorithmproceedsbydoingblockoperations.
• Forpprocesses,wecreatepblocksofsizen/p1/2.• Simpleapproach:
• all-to-allbroadcastineachrowofA• all-to-allbroadcastineachcolumnofB• Performcalculationwithlocaldataoneachprocess
• Iso-efficiency:
Matrix A Matrix B
26
CANNON’S ALGORITHM
● Therearetwoissueswiththissimplealgorithm:▪ Weshouldbeabletoincreasepcloserton3 atiso-
efficiency.▪ Thisalgorithmrequiresalotofmemorysinceaprocess
needstostoreanentireblockrowofAandblockcolumnofB.
● Cannon’salgorithmallowsreducingthememoryfootprint.● ItworksbycleverlyshufflingtheblocksofAandBsuchthat
eachprocessneverstoresmorethanoneblockofAandB.● TheblocksofAarerotatedinsideeachrowwhiletheblocks
ofBarerotatedinsideeachcolumn.● Thetrickistostartwiththerightalignment.● WewillseelaterthattheDekel-Nassimi-Sahni (DNS)
algorithmallowsimprovingthescalability significantly.
27
COMMUNICATION STEPS
A00B00
A01B01
A02B02
A03B03
A10B10
A11B11
A12B12
A13B13
A20B20
A21B21
A22B22
A23B23
A30B30
A31B31
A32B32
A33B33
A00B00
A01B11
A02B22
A03B33
A11B10
A12B21
A13B32
A10B03
A22B20
A23B31
A20B02
A21B13
A33B30
A30B01
A31B12
A32B23
● Afterthefirstcommunicationstep,eachprocesshasdatatoperformitsfirstblockmultiplication.
● Akeypointisthat,fromthenon,onlyasimplecommunicationwithneighborsisrequiredateachstep.
28
FIRST SHIFT
● BblocksareshiftedupwhileAblocksareshiftedleft.● Eachprocesshasnowthenexttwoblocksrequiredfortheproduct.● Asimilarsecondandthirdshiftsarerequiredtocompletethe
calculation:shiftBupandAleft.
A00B00
A01B11
A02B22
A03B33
A11B10
A12B21
A13B32
A10B03
A22B20
A23B31
A20B02
A21B13
A33B30
A30B01
A31B12
A32B23
A01B10
A02B21
A03B32
A00B03
A12B20
A13B31
A10B02
A11B13
A23B30
A20B01
A21B12
A22B23
A30B00
A31B11
A32B22
A33B33
29
SCALABILITY:INCREASING THE NUMBER OF PROCESSES
Dimensionofmatrix:1536,blocksize:1536,numberofprocsalongbothdims:11Thecalculationtook16.068481seconds;pxruntime=16.068481Dimensionofmatrix:1536,blocksize:768,numberofprocsalongbothdims:22Thecalculationtook4.514676seconds;pxruntime=18.058703Dimensionofmatrix:1536,blocksize:512,numberofprocsalongbothdims:33Thecalculationtook2.262061seconds;pxruntime=20.358550Dimensionofmatrix:1536,blocksize:384,numberofprocsalongbothdims:44Thecalculationtook2.193327seconds;pxruntime=35.093235Dimensionofmatrix:1536,blocksize:256,numberofprocsalongbothdims:66Thecalculationtook0.472822seconds;pxruntime=17.021599Dimensionofmatrix:1536,blocksize:192,numberofprocsalongbothdims:88Thecalculationtook0.174466seconds;pxruntime=11.165817
30
SCALABILITY:FIXED NUMBER OF PROCESSES
Dimension of matrix: 1536, block size: 512
Number of procs along both dims: 3 3
1 node The calculation took 2.090732 seconds
p x runtime = 18.816589
2 nodes The calculation took 1.783496 seconds
p x runtime = 16.051465
3 nodes The calculation took 1.705040 seconds
p x runtime = 15.345360
5 nodes The calculation took 1.671645 seconds
p x runtime = 15.044806
9 nodes The calculation took 1.651060 seconds
p x runtime = 14.859539
32
CANNON’S ALGORITHM
• MPIcode:mmm/
• Withthisalgorithm,processesstoreonly2blocksatatime.
• Costofcommunicationisslightlydifferentfromnaïvealgorithmbutintheendtherunningtimesarecomparable.
• Theiso-efficiencycurveisveryclosetotheotheralgorithmwithpscalingasn2.
• Cannon’smainfeatureisthereducedmemoryfootprint.ThisisimportantasMPIcodesoftenneedalotofmemory.
DEKEL-NASSIMI-SAHNI ALGORITHM
34
DEKEL-NASSIMI-SAHNI ALGORITHM
● ThenumberofoperationsisO(n3).Thereforeweshouldbeabletouseuptoaboutp=O(n3)processes.TheDNSalgorithmachievesthis.
● Inthisalgorithm,eachprocessPijk calculatesoneproductA(i,k)*B(k,j)
wherethesearematrixblocks.
35
INITIAL DISTRIBUTION OF DATA
● Weusep=n3 processes.
● WewanttocomputeC(i,j)+=A(i,k)*B(k,j)
● Process(i,j,k)willdothismultiplication.
● Then,areductionoverkisusedtogetthefinalC(i,j).
● Initially,theprocesses(i,j,0)holdthedataforAandB.
36
STEP 1
● Let’sfocusonA.Bissimilar.● Weneedtopropagatethe
blocksofAsuchthatprocess(i,j,k)hasA(i,k).
● Soprocess(i,j,0)(hasA(i,j))needstosenditsblocktoallprocesses(i,*,j).
● ThefirstcommunicationisaSendfrom(i,j,0)to(i,j,j).
● Dothisforalliandj. i
jk
37
STEP 2
● Wenowdoabroadcastinsideeachcolumn.
● (i,j,j)broadcaststoall(i,*,j).● Abroadcastwitha
communicatorisusedforthis.
i
jk
38
STEP 3
● Atlast,wecancomputesomething!● Process(i,j,k)hasA(i,k)andB(k,j).
39
STEP 4
● Wefinallydoareductioninsideeachverticalcolumn(indexk).● Process(i,j,0)hasC(i,j)intheend.
Reductionalongk
40
WHY IS DNSA BETTER ALGORITHM?
• Theiso-efficiencyis
• InCannon,p=O(n2),whereasinDNS,p=O(n3/(lnn)3).• Whatdoesitmeanthattheiso-efficiencycurveofDNSisbetterthanCannon?
• BecauseDNSisabletouselargerblocks,fewercommunicationsarerequired.
• Thealgorithmisthereforemoreefficient(hasbetterscalability).
• Communication=size2.Computation=size3.Largeblocksarefavorable.