lecture 21 - stanford universityweb.stanford.edu/class/cme213/files/lectures/lecture_21.pdf ·...

40
CME 213 Eric Darve SPRING 2017

Upload: others

Post on 15-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

CME213

EricDarve

SPRING 2017

Page 2: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

2

MAIN DEBUGGING TECHNIQUES

Awholeclasswouldbeneededonthistopic!1. Useassert:testconditionsonvariables:equality,inequality,

magnitude2. Printout,butassertsarefundamentallybetter3. Talktoyourduck4. Usetheoreticalresults,e.g.,convergencerate5. Testagainstreferencecode6. Manufacturedsolution:compareagainstknownsolutions7. Incrementalchanges:testaftereachchange(regressiontesting)8. Unit/moduletests9. Testinputsofincreasingdifficulty;codecoverage10. Simplifyproblemtominimalbuggyexample11. Dichotomy,divide-and-conquer12. Askpiazza,andyourTAs/instructor.

Page 3: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

PERFORMANCE METRICS

Page 4: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

4

WHY PERFORMANCE METRICS?

Understandingtheperformanceofacodeisimportant:● todevelopefficientcode● understandthebottlenecksofacode● comparealgorithms inameaningfulway,e.g.,matrix-

vectorproductsusingdifferentpartitioningschemesforthematrix.

Thetotalruntimecanbebrokendown,generallyspeaking,intothefollowingcategories:● Localcomputations● Dataexchangeandcommunication● Idletime(loadimbalance,spurioussynchronization)

Page 5: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

5

THE BASIC CONCEPT:SPEED-UP

● Thisquantitymeasureshowmuchfasterthecoderunsbecauseweareusingmanyprocesses.

● Define:

theoptimal(reference)runningtimewithasingleprocess.● Define:

therunningtimewithpprocesses.● Thespeed-upisthentheratio:

● Weexpectthisnumbertogoupaspaswekeepincreasingthenumberofprocesses.

Page 6: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

6

AMDAHL'S LAW● Althoughthisisacrudemodel,itcanprovideageneralsenseof

whatspeed-upcanbeachieved.Thisisagoodmeasuretounderstandhowmuchofthecodeoneshouldtrytoparallelize.

● Assumethatafractionfofthecodeisexecutedsequentially.Thenthespeed-upisgivenby:

● Thismeansthatthespeed-uphasanupperbound.Theefficiencymustgoto0eventually. Inmostcases,thisstatementistrueaspincreases.

● Howeverwearesavedbyanotherresult:foftendecreaseswithn,thatisthefractionofparallelworkincreaseswithn.

● ThisiswhywecanstillbenefitfromPetaandExascalemachineswith100,000+cores.

Page 7: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

7

PROFILING AND CUDACOMPUTING

● AsimilarsituationariseswithCUDA.1. Profileyourcode.Kernel1:90%2. Portkernel1toGPU.Kernel1:10%.Kernel2becomes80%

now.Furtheroptimizingkernel1willnolongeryieldanybenefit.

● Remembertoiteratebetweenprofilingandoptimizingyourcode.

● Always,focusonthelongestkernelasgivenbyyourprofiler.

Page 8: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

8

AMORE APPROPRIATE CONCEPT:EFFICIENCY

● Thepreviousdefinitionhasaproblem.Aspincreases,wewanttoknowwhetherthespeed-upscalesaspornot.

● Thismightbedifficulttoassessfromaplot.Ideally,thespeed-upisastraightline.

● Itisthereforemoreconvenient tolookattheefficiency:

● Ideallythatquantityissimplyaconstantaspincreases.Thatiseasiertoreadfromaplot.

● Themaximumvalueforefficiencyis1 (exceptinsomerarecircumstancesbecauseofcacheeffects).

Page 9: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

9

EFFICIENCY PLOTS

Typicalbehavioroftheefficiencyasthenumberofprocessesincreasesorastheproblemsize(amountofcomputationalworktoperform)increases.

Page 10: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

10

EXAMPLE 1:DOT PRODUCT

Two-stepalgorithm:

1. Calculatelocaldotproduct:

1. Useaspanningtreeforthefinalreduction:ln2 ppassesarerequired.

Page 11: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

11

Page 12: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

12

EXAMPLE 1:DOT PRODUCT

Totalruntimewithoneprocess:

Totalruntimeinparallel:

Page 13: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

13

EFFICIENCY

• Efficiency:

• Theefficiencycanbemaintainedprovidedthatwedonotscalepfasterthan:

• Thismeansthatpcannotincreaseasfastas.• Iso-efficiencymeansthatpincreasesataratesuchthattheefficiencyremainsconstant.

• Thisisveryimportant.Goodalgorithmsaresuchthatpcanbeincreasedrapidly atiso-efficiency.

Parallel Sequential

Page 14: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

14

EXAMPLE 2:MATRIX-VECTOR PRODUCT WITH 1DPARTITIONING• Inthatcase,wecanmodeltheserialrunningtimeas:

• Parallelrunningtime(comp+comm):

• Efficiency:

• Iso-efficiencyrequires:

Page 15: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p
Page 16: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

16

EXAMPLE 2:MATRIX-VECTOR PRODUCT WITH 1DPARTITIONING• Inthatcase,wecanmodeltheserialrunningtimeas:

• Parallelrunningtime(comp+comm):

• Efficiency:

• Iso-efficiencyrequires:

• Thisisactuallynotverygood.Weexpectthatpshouldincreaseliken2roughly,becausethisistheamountofworktodo.

Page 17: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

17

EXAMPLE 2:MATRIX-VECTOR PRODUCT WITH 2DPARTITIONING

• Recallthatwepreviouslysaidthat2Dpartitioningisbetter.Let’sseeiftheorysupportsourclaim.

• Computation:

• Sendbtodiagonalprocesses:

• Broadcastbineachcolumn:

• Reductionacrosscolumns(samerunningtimeasbroadcastbecausetheseoperationsaredualofoneanother):

Page 18: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

18

ISO-EFFICIENCY

• Withthepreviousresults:

• Iso-efficiency:

Page 19: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p
Page 20: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

20

ISO-EFFICIENCY

• Withthepreviousresults:

• Iso-efficiency:

• Thisismuchbetterthanwiththe1Dpartitioning.• Wecanincreasepmorerapidlyatiso-efficiency.• Anotherinterpretationisthatforagivennumberofprocesses,thisschemeisfaster.Practically,ithaslesscommunication.

Page 21: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

21

ISO-EFFICIENCY PLOTS

● Weplotforthetwoalgorithmsthevalueofpasafunctionofnsuchthat iso-efficiencyismaintained.

● Largervaluesofparebetter.● Thismeansimprovedscalability.

Largerpatiso-efficiency=algorithmismorescalable

Red=higherefficiencythanblue=runsfaster

Page 22: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

22

SUMMARY OF COMMUNICATION TIMES

Operation Hypercube timeOne-to-all broadcastAll-to-one reduction min((ts + twm) log p, 2(ts log p + twm))

All-to-all broadcastAll-to-all reduction ts log p + twm (p-1)

All-reduce min((ts + twm) log p, 2(ts log p + twm))Scatter, Gather ts log p + twm (p-1)All-to-all personalized (ts + twm) (p-1)Circular shift ts + twm

● m: size of message● p: number of processes● ts: latency● tw: bandwidth

Page 23: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

MATRIX-MATRIX PRODUCTS

Page 24: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

24

MATRIX-MATRIX PRODUCTS

Algorithminpseudo-code:

for i=0:n-1 dofor j=0:n-1 do

C(i,j) = 0;for k=0:n-1

C(i,j) += A(i,k) * B(k,j);end

endend

Page 25: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

25

NAÏVE BLOCK OPERATIONS

• Algorithmproceedsbydoingblockoperations.

• Forpprocesses,wecreatepblocksofsizen/p1/2.• Simpleapproach:

• all-to-allbroadcastineachrowofA• all-to-allbroadcastineachcolumnofB• Performcalculationwithlocaldataoneachprocess

• Iso-efficiency:

Matrix A Matrix B

Page 26: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

26

CANNON’S ALGORITHM

● Therearetwoissueswiththissimplealgorithm:▪ Weshouldbeabletoincreasepcloserton3 atiso-

efficiency.▪ Thisalgorithmrequiresalotofmemorysinceaprocess

needstostoreanentireblockrowofAandblockcolumnofB.

● Cannon’salgorithmallowsreducingthememoryfootprint.● ItworksbycleverlyshufflingtheblocksofAandBsuchthat

eachprocessneverstoresmorethanoneblockofAandB.● TheblocksofAarerotatedinsideeachrowwhiletheblocks

ofBarerotatedinsideeachcolumn.● Thetrickistostartwiththerightalignment.● WewillseelaterthattheDekel-Nassimi-Sahni (DNS)

algorithmallowsimprovingthescalability significantly.

Page 27: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

27

COMMUNICATION STEPS

A00B00

A01B01

A02B02

A03B03

A10B10

A11B11

A12B12

A13B13

A20B20

A21B21

A22B22

A23B23

A30B30

A31B31

A32B32

A33B33

A00B00

A01B11

A02B22

A03B33

A11B10

A12B21

A13B32

A10B03

A22B20

A23B31

A20B02

A21B13

A33B30

A30B01

A31B12

A32B23

● Afterthefirstcommunicationstep,eachprocesshasdatatoperformitsfirstblockmultiplication.

● Akeypointisthat,fromthenon,onlyasimplecommunicationwithneighborsisrequiredateachstep.

Page 28: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

28

FIRST SHIFT

● BblocksareshiftedupwhileAblocksareshiftedleft.● Eachprocesshasnowthenexttwoblocksrequiredfortheproduct.● Asimilarsecondandthirdshiftsarerequiredtocompletethe

calculation:shiftBupandAleft.

A00B00

A01B11

A02B22

A03B33

A11B10

A12B21

A13B32

A10B03

A22B20

A23B31

A20B02

A21B13

A33B30

A30B01

A31B12

A32B23

A01B10

A02B21

A03B32

A00B03

A12B20

A13B31

A10B02

A11B13

A23B30

A20B01

A21B12

A22B23

A30B00

A31B11

A32B22

A33B33

Page 29: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

29

SCALABILITY:INCREASING THE NUMBER OF PROCESSES

Dimensionofmatrix:1536,blocksize:1536,numberofprocsalongbothdims:11Thecalculationtook16.068481seconds;pxruntime=16.068481Dimensionofmatrix:1536,blocksize:768,numberofprocsalongbothdims:22Thecalculationtook4.514676seconds;pxruntime=18.058703Dimensionofmatrix:1536,blocksize:512,numberofprocsalongbothdims:33Thecalculationtook2.262061seconds;pxruntime=20.358550Dimensionofmatrix:1536,blocksize:384,numberofprocsalongbothdims:44Thecalculationtook2.193327seconds;pxruntime=35.093235Dimensionofmatrix:1536,blocksize:256,numberofprocsalongbothdims:66Thecalculationtook0.472822seconds;pxruntime=17.021599Dimensionofmatrix:1536,blocksize:192,numberofprocsalongbothdims:88Thecalculationtook0.174466seconds;pxruntime=11.165817

Page 30: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

30

SCALABILITY:FIXED NUMBER OF PROCESSES

Dimension of matrix: 1536, block size: 512

Number of procs along both dims: 3 3

1 node The calculation took 2.090732 seconds

p x runtime = 18.816589

2 nodes The calculation took 1.783496 seconds

p x runtime = 16.051465

3 nodes The calculation took 1.705040 seconds

p x runtime = 15.345360

5 nodes The calculation took 1.671645 seconds

p x runtime = 15.044806

9 nodes The calculation took 1.651060 seconds

p x runtime = 14.859539

Page 31: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p
Page 32: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

32

CANNON’S ALGORITHM

• MPIcode:mmm/

• Withthisalgorithm,processesstoreonly2blocksatatime.

• Costofcommunicationisslightlydifferentfromnaïvealgorithmbutintheendtherunningtimesarecomparable.

• Theiso-efficiencycurveisveryclosetotheotheralgorithmwithpscalingasn2.

• Cannon’smainfeatureisthereducedmemoryfootprint.ThisisimportantasMPIcodesoftenneedalotofmemory.

Page 33: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

DEKEL-NASSIMI-SAHNI ALGORITHM

Page 34: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

34

DEKEL-NASSIMI-SAHNI ALGORITHM

● ThenumberofoperationsisO(n3).Thereforeweshouldbeabletouseuptoaboutp=O(n3)processes.TheDNSalgorithmachievesthis.

● Inthisalgorithm,eachprocessPijk calculatesoneproductA(i,k)*B(k,j)

wherethesearematrixblocks.

Page 35: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

35

INITIAL DISTRIBUTION OF DATA

● Weusep=n3 processes.

● WewanttocomputeC(i,j)+=A(i,k)*B(k,j)

● Process(i,j,k)willdothismultiplication.

● Then,areductionoverkisusedtogetthefinalC(i,j).

● Initially,theprocesses(i,j,0)holdthedataforAandB.

Page 36: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

36

STEP 1

● Let’sfocusonA.Bissimilar.● Weneedtopropagatethe

blocksofAsuchthatprocess(i,j,k)hasA(i,k).

● Soprocess(i,j,0)(hasA(i,j))needstosenditsblocktoallprocesses(i,*,j).

● ThefirstcommunicationisaSendfrom(i,j,0)to(i,j,j).

● Dothisforalliandj. i

jk

Page 37: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

37

STEP 2

● Wenowdoabroadcastinsideeachcolumn.

● (i,j,j)broadcaststoall(i,*,j).● Abroadcastwitha

communicatorisusedforthis.

i

jk

Page 38: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

38

STEP 3

● Atlast,wecancomputesomething!● Process(i,j,k)hasA(i,k)andB(k,j).

Page 39: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

39

STEP 4

● Wefinallydoareductioninsideeachverticalcolumn(indexk).● Process(i,j,0)hasC(i,j)intheend.

Reductionalongk

Page 40: Lecture 21 - Stanford Universityweb.stanford.edu/class/cme213/files/lectures/Lecture_21.pdf · 2017. 6. 9. · 21 ISO-EFFICIENCYPLOTS We plot for the two algorithms the value of p

40

WHY IS DNSA BETTER ALGORITHM?

• Theiso-efficiencyis

• InCannon,p=O(n2),whereasinDNS,p=O(n3/(lnn)3).• Whatdoesitmeanthattheiso-efficiencycurveofDNSisbetterthanCannon?

• BecauseDNSisabletouselargerblocks,fewercommunicationsarerequired.

• Thealgorithmisthereforemoreefficient(hasbetterscalability).

• Communication=size2.Computation=size3.Largeblocksarefavorable.