lecture 21 - stanford universityweb.stanford.edu/class/cme213/files/lectures/lecture_21.pdf ·...

CME213

EricDarve

SPRING 2017

2

MAIN DEBUGGING TECHNIQUES

Awholeclasswouldbeneededonthistopic!1. Useassert:testconditionsonvariables:equality,inequality,

magnitude2. Printout,butassertsarefundamentallybetter3. Talktoyourduck4. Usetheoreticalresults,e.g.,convergencerate5. Testagainstreferencecode6. Manufacturedsolution:compareagainstknownsolutions7. Incrementalchanges:testaftereachchange(regressiontesting)8. Unit/moduletests9. Testinputsofincreasingdifficulty;codecoverage10. Simplifyproblemtominimalbuggyexample11. Dichotomy,divide-and-conquer12. Askpiazza,andyourTAs/instructor.

PERFORMANCE METRICS

4

WHY PERFORMANCE METRICS?

Understandingtheperformanceofacodeisimportant:● todevelopefficientcode● understandthebottlenecksofacode● comparealgorithms inameaningfulway,e.g.,matrix-

vectorproductsusingdifferentpartitioningschemesforthematrix.

Thetotalruntimecanbebrokendown,generallyspeaking,intothefollowingcategories:● Localcomputations● Dataexchangeandcommunication● Idletime(loadimbalance,spurioussynchronization)

5

THE BASIC CONCEPT:SPEED-UP

● Thisquantitymeasureshowmuchfasterthecoderunsbecauseweareusingmanyprocesses.

● Define:

theoptimal(reference)runningtimewithasingleprocess.● Define:

therunningtimewithpprocesses.● Thespeed-upisthentheratio:

● Weexpectthisnumbertogoupaspaswekeepincreasingthenumberofprocesses.

6

AMDAHL'S LAW● Althoughthisisacrudemodel,itcanprovideageneralsenseof

whatspeed-upcanbeachieved.Thisisagoodmeasuretounderstandhowmuchofthecodeoneshouldtrytoparallelize.

● Assumethatafractionfofthecodeisexecutedsequentially.Thenthespeed-upisgivenby:

● Thismeansthatthespeed-uphasanupperbound.Theefficiencymustgoto0eventually. Inmostcases,thisstatementistrueaspincreases.

● Howeverwearesavedbyanotherresult:foftendecreaseswithn,thatisthefractionofparallelworkincreaseswithn.

● ThisiswhywecanstillbenefitfromPetaandExascalemachineswith100,000+cores.

7

PROFILING AND CUDACOMPUTING

● AsimilarsituationariseswithCUDA.1. Profileyourcode.Kernel1:90%2. Portkernel1toGPU.Kernel1:10%.Kernel2becomes80%

now.Furtheroptimizingkernel1willnolongeryieldanybenefit.

● Remembertoiteratebetweenprofilingandoptimizingyourcode.

● Always,focusonthelongestkernelasgivenbyyourprofiler.

8

AMORE APPROPRIATE CONCEPT:EFFICIENCY

● Thepreviousdefinitionhasaproblem.Aspincreases,wewanttoknowwhetherthespeed-upscalesaspornot.

● Thismightbedifficulttoassessfromaplot.Ideally,thespeed-upisastraightline.

● Itisthereforemoreconvenient tolookattheefficiency:

● Ideallythatquantityissimplyaconstantaspincreases.Thatiseasiertoreadfromaplot.

● Themaximumvalueforefficiencyis1 (exceptinsomerarecircumstancesbecauseofcacheeffects).

9

EFFICIENCY PLOTS

Typicalbehavioroftheefficiencyasthenumberofprocessesincreasesorastheproblemsize(amountofcomputationalworktoperform)increases.

10

EXAMPLE 1:DOT PRODUCT

Two-stepalgorithm:

1. Calculatelocaldotproduct:

1. Useaspanningtreeforthefinalreduction:ln2 ppassesarerequired.

12

EXAMPLE 1:DOT PRODUCT

Totalruntimewithoneprocess:

Totalruntimeinparallel:

13

EFFICIENCY

• Efficiency:

• Theefficiencycanbemaintainedprovidedthatwedonotscalepfasterthan:

• Thismeansthatpcannotincreaseasfastas.• Iso-efficiencymeansthatpincreasesataratesuchthattheefficiencyremainsconstant.

• Thisisveryimportant.Goodalgorithmsaresuchthatpcanbeincreasedrapidly atiso-efficiency.

Parallel Sequential

14

EXAMPLE 2:MATRIX-VECTOR PRODUCT WITH 1DPARTITIONING• Inthatcase,wecanmodeltheserialrunningtimeas:

• Parallelrunningtime(comp+comm):

• Efficiency:

• Iso-efficiencyrequires:

16

EXAMPLE 2:MATRIX-VECTOR PRODUCT WITH 1DPARTITIONING• Inthatcase,wecanmodeltheserialrunningtimeas:

• Parallelrunningtime(comp+comm):

• Efficiency:

• Iso-efficiencyrequires:

• Thisisactuallynotverygood.Weexpectthatpshouldincreaseliken2roughly,becausethisistheamountofworktodo.

17

EXAMPLE 2:MATRIX-VECTOR PRODUCT WITH 2DPARTITIONING

• Recallthatwepreviouslysaidthat2Dpartitioningisbetter.Let’sseeiftheorysupportsourclaim.

• Computation:

• Sendbtodiagonalprocesses:

• Broadcastbineachcolumn:

• Reductionacrosscolumns(samerunningtimeasbroadcastbecausetheseoperationsaredualofoneanother):

18

ISO-EFFICIENCY

• Withthepreviousresults:

• Iso-efficiency:

20

ISO-EFFICIENCY

• Withthepreviousresults:

• Iso-efficiency:

• Thisismuchbetterthanwiththe1Dpartitioning.• Wecanincreasepmorerapidlyatiso-efficiency.• Anotherinterpretationisthatforagivennumberofprocesses,thisschemeisfaster.Practically,ithaslesscommunication.

21

ISO-EFFICIENCY PLOTS

● Weplotforthetwoalgorithmsthevalueofpasafunctionofnsuchthat iso-efficiencyismaintained.

● Largervaluesofparebetter.● Thismeansimprovedscalability.

Largerpatiso-efficiency=algorithmismorescalable

Red=higherefficiencythanblue=runsfaster

22

SUMMARY OF COMMUNICATION TIMES

Operation Hypercube timeOne-to-all broadcastAll-to-one reduction min((ts + twm) log p, 2(ts log p + twm))

All-to-all broadcastAll-to-all reduction ts log p + twm (p-1)

All-reduce min((ts + twm) log p, 2(ts log p + twm))Scatter, Gather ts log p + twm (p-1)All-to-all personalized (ts + twm) (p-1)Circular shift ts + twm

● m: size of message● p: number of processes● ts: latency● tw: bandwidth

MATRIX-MATRIX PRODUCTS

24

MATRIX-MATRIX PRODUCTS

Algorithminpseudo-code:

for i=0:n-1 dofor j=0:n-1 do

C(i,j) = 0;for k=0:n-1

C(i,j) += A(i,k) * B(k,j);end

endend

25

NAÏVE BLOCK OPERATIONS

• Algorithmproceedsbydoingblockoperations.

• Forpprocesses,wecreatepblocksofsizen/p1/2.• Simpleapproach:

• all-to-allbroadcastineachrowofA• all-to-allbroadcastineachcolumnofB• Performcalculationwithlocaldataoneachprocess

• Iso-efficiency:

Matrix A Matrix B

26

CANNON’S ALGORITHM

● Therearetwoissueswiththissimplealgorithm:▪ Weshouldbeabletoincreasepcloserton3 atiso-

efficiency.▪ Thisalgorithmrequiresalotofmemorysinceaprocess

needstostoreanentireblockrowofAandblockcolumnofB.

● Cannon’salgorithmallowsreducingthememoryfootprint.● ItworksbycleverlyshufflingtheblocksofAandBsuchthat

eachprocessneverstoresmorethanoneblockofAandB.● TheblocksofAarerotatedinsideeachrowwhiletheblocks

ofBarerotatedinsideeachcolumn.● Thetrickistostartwiththerightalignment.● WewillseelaterthattheDekel-Nassimi-Sahni (DNS)

algorithmallowsimprovingthescalability significantly.

27

COMMUNICATION STEPS

A00B00

A01B01

A02B02

A03B03

A10B10

A11B11

A12B12

A13B13

A20B20

A21B21

A22B22

A23B23

A30B30

A31B31

A32B32

A33B33

A00B00

A01B11

A02B22

A03B33

A11B10

A12B21

A13B32

A10B03

A22B20

A23B31

A20B02

A21B13

A33B30

A30B01

A31B12

A32B23

● Afterthefirstcommunicationstep,eachprocesshasdatatoperformitsfirstblockmultiplication.

● Akeypointisthat,fromthenon,onlyasimplecommunicationwithneighborsisrequiredateachstep.

28

FIRST SHIFT

● BblocksareshiftedupwhileAblocksareshiftedleft.● Eachprocesshasnowthenexttwoblocksrequiredfortheproduct.● Asimilarsecondandthirdshiftsarerequiredtocompletethe

calculation:shiftBupandAleft.

A00B00

A01B11

A02B22

A03B33

A11B10

A12B21

A13B32

A10B03

A22B20

A23B31

A20B02

A21B13

A33B30

A30B01

A31B12

A32B23

A01B10

A02B21

A03B32

A00B03

A12B20

A13B31

A10B02

A11B13

A23B30

A20B01

A21B12

A22B23

A30B00

A31B11

A32B22

A33B33

29

SCALABILITY:INCREASING THE NUMBER OF PROCESSES

Dimensionofmatrix:1536,blocksize:1536,numberofprocsalongbothdims:11Thecalculationtook16.068481seconds;pxruntime=16.068481Dimensionofmatrix:1536,blocksize:768,numberofprocsalongbothdims:22Thecalculationtook4.514676seconds;pxruntime=18.058703Dimensionofmatrix:1536,blocksize:512,numberofprocsalongbothdims:33Thecalculationtook2.262061seconds;pxruntime=20.358550Dimensionofmatrix:1536,blocksize:384,numberofprocsalongbothdims:44Thecalculationtook2.193327seconds;pxruntime=35.093235Dimensionofmatrix:1536,blocksize:256,numberofprocsalongbothdims:66Thecalculationtook0.472822seconds;pxruntime=17.021599Dimensionofmatrix:1536,blocksize:192,numberofprocsalongbothdims:88Thecalculationtook0.174466seconds;pxruntime=11.165817

30

SCALABILITY:FIXED NUMBER OF PROCESSES

Dimension of matrix: 1536, block size: 512

Number of procs along both dims: 3 3

1 node The calculation took 2.090732 seconds

p x runtime = 18.816589

2 nodes The calculation took 1.783496 seconds

p x runtime = 16.051465


p x runtime = 15.345360


p x runtime = 15.044806


p x runtime = 14.859539

32

CANNON’S ALGORITHM

• MPIcode:mmm/

• Withthisalgorithm,processesstoreonly2blocksatatime.

• Costofcommunicationisslightlydifferentfromnaïvealgorithmbutintheendtherunningtimesarecomparable.

• Theiso-efficiencycurveisveryclosetotheotheralgorithmwithpscalingasn2.

• Cannon’smainfeatureisthereducedmemoryfootprint.ThisisimportantasMPIcodesoftenneedalotofmemory.

DEKEL-NASSIMI-SAHNI ALGORITHM

34

DEKEL-NASSIMI-SAHNI ALGORITHM

● ThenumberofoperationsisO(n3).Thereforeweshouldbeabletouseuptoaboutp=O(n3)processes.TheDNSalgorithmachievesthis.

● Inthisalgorithm,eachprocessPijk calculatesoneproductA(i,k)*B(k,j)

wherethesearematrixblocks.

35

INITIAL DISTRIBUTION OF DATA

● Weusep=n3 processes.

● WewanttocomputeC(i,j)+=A(i,k)*B(k,j)

● Process(i,j,k)willdothismultiplication.

● Then,areductionoverkisusedtogetthefinalC(i,j).

● Initially,theprocesses(i,j,0)holdthedataforAandB.

36

STEP 1

● Let’sfocusonA.Bissimilar.● Weneedtopropagatethe

blocksofAsuchthatprocess(i,j,k)hasA(i,k).

● Soprocess(i,j,0)(hasA(i,j))needstosenditsblocktoallprocesses(i,*,j).

● ThefirstcommunicationisaSendfrom(i,j,0)to(i,j,j).

● Dothisforalliandj. i

jk

37

STEP 2

● Wenowdoabroadcastinsideeachcolumn.

● (i,j,j)broadcaststoall(i,*,j).● Abroadcastwitha

communicatorisusedforthis.

i

jk

38

STEP 3

● Atlast,wecancomputesomething!● Process(i,j,k)hasA(i,k)andB(k,j).

39

STEP 4

● Wefinallydoareductioninsideeachverticalcolumn(indexk).● Process(i,j,0)hasC(i,j)intheend.

Reductionalongk

40

WHY IS DNSA BETTER ALGORITHM?

• Theiso-efficiencyis

• InCannon,p=O(n2),whereasinDNS,p=O(n3/(lnn)3).• Whatdoesitmeanthattheiso-efficiencycurveofDNSisbetterthanCannon?

• BecauseDNSisabletouselargerblocks,fewercommunicationsarerequired.

• Thealgorithmisthereforemoreefficient(hasbetterscalability).

• Communication=size2.Computation=size3.Largeblocksarefavorable.

lecture 21 - stanford universityweb.stanford.edu/class/cme213/files/lectures/lecture_21.pdf ·...

Documents