introduc/on to communicaon-avoiding algorithms · 1/25/16 3 why minimize communicaon? (2/2) 1 10...

1/25/16

1

Introduc/ontoCommunica/on-AvoidingAlgorithms

www.cs.berkeley.edu/~demmel

JimDemmel,andmany,manyothers…

2

Whyavoidcommunica/on?(1/2)Algorithmshavetwocosts(measuredin/meorenergy):1. Arithme/c(FLOPS)2. Communica/on:movingdatabetween

–  levelsofamemoryhierarchy(sequen/alcase)–  processorsoveranetwork(parallelcase).

CPUCache

DRAM

CPUDRAM

CPUDRAM

CPUDRAM

CPUDRAM

1/25/16

2

Whyavoidcommunica/on?(2/3)•  Running/meofanalgorithmissumof3terms:

–  #flops*/me_per_flop–  #wordsmoved/bandwidth–  #messages*latency

3"

communica/on

•  Time_per_flop<<1/bandwidth<<latency•  Gapsgrowingexponen/allywith/me[FOSC]

•  Avoidcommunica/ontosave/me

•  Goal:reorganizealgorithmstoavoidcommunica/on•  Betweenallmemoryhierarchylevels

•  L1L2DRAMnetwork,etc•  Verylargespeedupspossible•  Energysavingstoo!

Annualimprovements

Time_per_flop Bandwidth Latency

Network 26% 15%

DRAM 23% 5%

59%

WhyMinimizeCommunica/on?(2/2)

1

10

100

1000

10000

PicoJoules

now

2018

Source:JohnShalf,LBL

1/25/16

3

WhyMinimizeCommunica/on?(2/2)

1

10

100

1000

10000

PicoJoules

now

2018

Source:JohnShalf,LBL

Minimizecommunica/ontosaveenergy

Alterna/veCostModelforAlgorithms?

Totaldistancemovedbybeadsonanabacus

1/25/16

4

Goals

7"

•  Redesignalgorithmstoavoidcommunica/on•  Betweenallmemoryhierarchylevels

•  L1L2DRAMnetwork,etc

• Alainlowerboundsifpossible•  Currentalgorithmsomenfarfromlowerbounds•  Largespeedupsandenergysavingspossible

•  Lotsofopenproblems/poten/alclassprojects

SampleSpeedups•  Up to 12x faster for 2.5D matmul on 64K core IBM BG/P

•  Up to 3x faster for tensor contractions on 2K core Cray XE/6

•  Up to 6.2x faster for All-Pairs-Shortest-Path on 24K core Cray CE6

•  Up to 2.1x faster for 2.5D LU on 64K core IBM BG/P

•  Up to 11.8x faster for direct N-body on 32K core IBM BG/P

•  Up to 13x faster for Tall Skinny QR on Tesla C2050 Fermi NVIDIA GPU

•  Up to 6.7x faster for symeig(band A) on 10 core Intel Westmere

•  Up to 2x faster for 2.5D Strassen on 38K core Cray XT4

•  Up to 4.2x faster for MiniGMG benchmark bottom solver, using CA-BiCGStab (2.5x for overall solve)

–  2.5x / 1.5x for combustion simulation code

8

1/25/16

5

“NewAlgorithmImprovesPerformanceandAccuracyonExtreme-ScaleCompu/ngSystems.Onmoderncomputerarchitectures,communica:onbetweenprocessorstakeslongerthantheperformanceofafloa:ngpointarithme:copera:onbyagivenprocessor.ASCRresearchershavedevelopedanewmethod,derivedfromcommonlyusedlinearalgebramethods,tominimizecommunica:onsbetweenprocessorsandthememoryhierarchy,byreformula:ngthecommunica:onpaDernsspecifiedwithinthealgorithm.ThismethodhasbeenimplementedintheTRILINOSframework,ahighly-regardedsuiteofsomware,whichprovidesfunc/onalityforresearchersaroundtheworldtosolvelargescale,complexmul/-physicsproblems.”FY2010CongressionalBudget,Volume4,FY2010Accomplishments,AdvancedScien/ficCompu/ng

Research(ASCR),pages65-67.

PresidentObamacitesCommunica/on-AvoidingAlgorithmsintheFY2012DepartmentofEnergyBudgetRequesttoCongress:

CA-GMRES(Hoemmen,Mohiyuddin,Yelick,JD)“Tall-Skinny”QR(Grigori,Hoemmen,Langou,JD)

SummaryofCAAlgorithms•  “Direct”LinearAlgebra

•  Lowerboundsoncommunica/onforlinearalgebraproblemslikeAx=b,leastsquares,Ax=λx,SVD,etc

•  Newalgorithmsthatalaintheselowerbounds• Beingaddedtolibraries:Sca/LAPACK,PLASMA,MAGMA

•  Largespeed-upspossible• Autotuningtofindop/malimplementa/on

•  Diloforprogramsaccessingarrays(egn-body)•  Dilofor“Itera/ve”LinearAlgebra

1/25/16

6

Outline•  “Direct”LinearAlgebra

•  Lowerboundsoncommunica/on•  Newalgorithmsthatalaintheselowerbounds

•  Diloforprogramsaccessingarrays(egn-body)•  Dilofor“Itera/ve”LinearAlgebra•  Relatedwork




1/25/16

7

Lowerboundforall“direct”linearalgebra

•  Holdsfor– Matmul,BLAS,LU,QR,eig,SVD,tensorcontrac/ons,…–  Somewholeprograms(sequencesoftheseopera/ons,nomalerhowindividualopsareinterleaved,egAk)

–  Denseandsparsematrices(where#flops<<n3)–  Sequen/alandparallelalgorithms–  Somegraph-theore/calgorithms(egFloyd-Warshall)

13

• LetM=“fast”memorysize(perprocessor)

#words_moved(perprocessor)=Ω(#flops(perprocessor)/M1/2)

#messages_sent(perprocessor)=Ω(#flops(perprocessor)/M3/2)

• Parallelcase:assumeeitherloadormemorybalanced




14



#messages_sent≥#words_moved/largest_message_size


1/25/16

8




15



#messages_sent(perprocessor)=Ω(#flops(perprocessor)/M3/2)


SIAMSIAG/LinearAlgebraPrize,2012Ballard,D.,Holtz,Schwartz

Canwealaintheselowerbounds?•  Doconven/onaldensealgorithmsasimplementedinLAPACKandScaLAPACKalainthesebounds?– Omennot

•  Ifnot,arethereotheralgorithmsthatdo?– Yes,formuchofdenselinearalgebra– Newalgorithms,withnewnumericalproper/es,newwaystoencodeanswers,newdatastructures

– Notjustlooptransforma/ons•  Onlyafewsparsealgorithmssofar•  Lotsofworkinprogress/possibleprojects•  Casestudy:MatrixMul/ply

16

1/25/16

9

17"

NaïveMatrixMul/ply{implementsC=C+A*B}fori=1tonforj=1tonfork=1tonC(i,j)=C(i,j)+A(i,k)*B(k,j)

= + * C(i,j) A(i,:)

B(:,j) C(i,j)

18"

NaïveMatrixMul/ply{implementsC=C+A*B}fori=1ton{readrowiofAintofastmemory}forj=1ton{readC(i,j)intofastmemory}{readcolumnjofBintofastmemory}fork=1tonC(i,j)=C(i,j)+A(i,k)*B(k,j){writeC(i,j)backtoslowmemory}

= + * C(i,j) A(i,:)

B(:,j) C(i,j)

1/25/16

10

19"

NaïveMatrixMul/ply{implementsC=C+A*B}fori=1ton{readrowiofAintofastmemory}…n2readsaltogetherforj=1ton{readC(i,j)intofastmemory}…n2readsaltogether{readcolumnjofBintofastmemory}…n3readsaltogetherfork=1tonC(i,j)=C(i,j)+A(i,k)*B(k,j){writeC(i,j)backtoslowmemory}…n2writesaltogether

= + * C(i,j) A(i,:)

B(:,j) C(i,j)

n3+3n2reads/writesaltogether–dominates2n3arithme/c

20"

Blocked(Tiled)MatrixMul/plyConsiderA,B,Ctoben/b-by-n/bmatricesofb-by-bsubblockswhere

biscalledtheblocksize;assume3b-by-bblocksfitinfastmemoryfori=1ton/b

forj=1ton/b {readblockC(i,j)intofastmemory} fork=1ton/b {readblockA(i,k)intofastmemory} {readblockB(k,j)intofastmemory} C(i,j)=C(i,j)+A(i,k)*B(k,j){doamatrixmul/plyonblocks} {writeblockC(i,j)backtoslowmemory}

= + * C(i,j) C(i,j) A(i,k)

B(k,j) b-by-bblock

1/25/16

11

21"

Blocked(Tiled)MatrixMul/plyConsiderA,B,Ctoben/b-by-n/bmatricesofb-by-bsubblockswhere

biscalledtheblocksize;assume3b-by-bblocksfitinfastmemoryfori=1ton/b

forj=1ton/b {readblockC(i,j)intofastmemory}…b2×(n/b)2=n2reads fork=1ton/b {readblockA(i,k)intofastmemory}…b2×(n/b)3=n3/breads {readblockB(k,j)intofastmemory}…b2×(n/b)3=n3/breads C(i,j)=C(i,j)+A(i,k)*B(k,j){doamatrixmul/plyonblocks} {writeblockC(i,j)backtoslowmemory}…b2×(n/b)2=n2writes

= + * C(i,j) C(i,j) A(i,k)

B(k,j) b-by-bblock

2n3/b+2n2reads/writes<<2n3arithme/c-Faster!

Doesblockedmatmulalainlowerbound?•  Recall:if3b-by-bblocksfitinfastmemoryofsizeM,then#reads/writes=2n3/b+2n2

•  Makebaslargeaspossible:3b2≤M,so#reads/writes≥2n3/(M/3)1/2+2n2

•  Alainslowerbound=Ω(#flops/M1/2)

•  Butwhatifwedon’tknowM?•  Oriftherearemul/plelevelsoffastmemory?•  Howdowewritethealgorithm?

22

1/25/16

12

Howhardishand-tuningmatmul,anyway?

23"

•  Results of 22 student teams trying to tune matrix-multiply, in CS267 Spr09 •  Students given “blocked” code to start with (7x faster than naïve)

•  Still hard to get close to vendor tuned performance (ACML) (another 6x) •  For more discussion, see www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/

Howhardishand-tuningmatmul,anyway?

24"

1/25/16

13

RecursiveMatrixMul/plica/on(RMM)(1/2)•  Forsimplicity:squarematriceswithn=2m

•  C==A·B=··

=

•  TruewheneachAijetc1x1orn/2xn/2

25"

A11 A12 A21 A22

B11 B12 B21 B22

C11 C12 C21 C22

A11·B11 + A12·B21 A11·B12 + A12·B22 A21·B11 + A22·B21 A21·B12 + A22·B22

func C = RMM (A, B, n) if n = 1, C = A * B, else { C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2) C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2) C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2) C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) } return

RecursiveMatrixMul/plica/on(RMM)(2/2)

26"

func C = RMM (A, B, n) if n=1, C = A * B, else { C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2) C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2) C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2) C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) } return

A(n) = # arithmetic operations in RMM( . , . , n) = 8 · A(n/2) + 4(n/2)2 if n > 1, else 1 = 2n3 … same operations as usual, in different order W(n) = # words moved between fast, slow memory by RMM( . , . , n) = 8 · W(n/2) + 12(n/2)2 if 3n2 > M , else 3n2 = O( n3 / M1/2 + n2 ) … same as blocked matmul “Cache oblivious”, works for memory hierarchies, but not panacea

Forbigspeedups,seeSC12posteron“Bea/ngMKLandScaLAPACKatRectangularMatmul”11/13at5:15-7pm

1/25/16

14

CARMAPerformance:SharedMemory

Square:m=k=n

MKL(double)CARMA(double)

MKL(single)CARMA(single)

Peak(single)

Peak(double)

(log)

(linear)

PreliminariesLowerBoundsCARMABenchmarkingFutureWorkConclusion27

IntelEmerald:4IntelXeonX7560x8cores,4xNUMA

CARMAPerformance:SharedMemory

InnerProduct:m=n=64

MKL(double)

CARMA(double)

MKL(single)

CARMA(single)

(log)

(linear)


IntelEmerald:4IntelXeonX7560x8cores,4xNUMA

1/25/16

15

WhyisCARMAFaster?L3CacheMisses

SharedMemoryInnerProduct(m=n=64;k=524,288)

97%FewerMisses

86%FewerMisses

(linear)


ParallelMatMulwith2DProcessorLayout

•  PprocessorsinP1/2xP1/2grid–  Processorscommunicatealongrows,columns

•  Eachprocessorownsn/P1/2xn/P1/2submatricesofA,B,C•  Example:P=16,processorsnumberedfromP00toP33

–  ProcessorPijownssubmatricesAij,BijandCij

P00P01P02P03

P10P11P12P13

P20P21P22P23

P30P31P32P33

P00P01P02P03

P10P11P12P13

P20P21P22P23

P30P31P32P33

P00P01P02P03

P10P11P12P13

P20P21P22P23

P30P31P32P33

C=A*B

1/25/16

16

31"

SUMMAAlgorithm•  SUMMA=ScalableUniversalMatrixMul/ply

–  Alainslowerbounds:•  AssumefastmemorysizeM=O(n2/P)perprocessor–1copyofdata•  #words_moved=Ω(#flops/M1/2)=Ω((n3/P)/(n2/P)1/2)=Ω(n2/P1/2)•  #messages=Ω(#flops/M3/2)=Ω((n3/P)/(n2/P)3/2)=Ω(P1/2)

–  Canaccommodateanyprocessorgrid,matrixdimensions&layout

–  Usedinprac/ceinPBLAS=ParallelBLAS•  www.netlib.org/lapack/lawns/lawn{96,100}.ps

•  ComparisontoCannon’sAlgorithm–  Cannonalainslowerbound–  ButCannonhardertogeneralizetoothergrids,dimensions,layouts,andCannonmayusemore

memory

32"

SUMMA–nxnmatmulonP1/2xP1/2grid

•  C(i, j) is n/P1/2 x n/P1/2 submatrix of C on processor Pij"•  A(i,k) is n/P1/2 x b submatrix of A"•  B(k,j) is b x n/P1/2 submatrix of B "•  C(i,j) = C(i,j) + Σk A(i,k)*B(k,j) "

•  summation over submatrices"•  Need not be square processor grid "

* = i!

j!

A(i,k)!

k!k!B(k,j)!

C(i,j)

1/25/16

17

33"

SUMMA–nxnmatmulonP1/2xP1/2grid

For k=0 to n-1 … or n/b-1 where b is the block size … = # cols in A(i,k) and # rows in B(k,j) for all i = 1 to pr … in parallel owner of A(i,k) broadcasts it to whole processor row for all j = 1 to pc … in parallel owner of B(k,j) broadcasts it to whole processor column Receive A(i,k) into Acol Receive B(k,j) into Brow C_myproc = C_myproc + Acol * Brow

* = i!

j!

A(i,k)!

k!k!B(k,j)!

C(i,j)

For k=0 to n/b-1 for all i = 1 to P1/2

owner of A(i,k) broadcasts it to whole processor row (using binary tree) for all j = 1 to P1/2

owner of B(k,j) broadcasts it to whole processor column (using bin. tree) Receive A(i,k) into Acol Receive B(k,j) into Brow C_myproc = C_myproc + Acol * Brow

Brow

Acol

•  Alainsbandwidthlowerbound•  Alainslatencylowerboundifbnearmaximumn/P1/2

Summaryofdenseparallelalgorithmsalainingcommunica/onlowerbounds

34

•  AssumenxnmatricesonPprocessors•  MinimumMemoryperprocessor=M=O(n2/P)•  Recalllowerbounds:

#words_moved=Ω((n3/P)/M1/2)=Ω(n2/P1/2)#messages=Ω((n3/P)/M3/2)=Ω(P1/2)

•  DoesScaLAPACKalainthesebounds?•  For#words_moved:mostly,exceptnonsym.Eigenproblem•  For#messages:asympto/callyworse,exceptCholesky

•  Newalgorithmsalainallbounds,uptopolylog(P)factors•  Cholesky,LU,QR,Sym.andNonsymeigenproblems,SVD

•  Neededtoreplacepar/alpivo/nginLU•  Needrandomiza/onforNonsymeigenproblem(sofar)

CanwedoBeler?

1/25/16

18

Canwedobeler?•  Aren’twealreadyop/mal?•  WhyassumeM=O(n2/P),i.e.minimal?

– Lowerbounds/lltrueifmorememory– Canwealainit?– Specialcase:“3DMatmul”:usesM=O(n2/P2/3)

• Dekel,Nassimi,Sahni[81],Bernsten[89],Agarwal,Chandra,Snir[90],Johnson[93],Agarwal,Balle,Gustavson,Joshi,Palkar[95]

• ProcessorsarrangedinP1/3xP1/3xP1/3grid– M=O(n2/P2/3)isP1/3/mestheminimum

•  Notalwaysthatmuchmemoryavailable…

2.5DMatrixMul/plica/on

•  Assumecanfitcn2/Pdataperprocessor,c>1•  Processorsform(P/c)1/2x(P/c)1/2xcgrid

c

(P/c)1/2

(P/c)1/2

Example:P=32,c=2

1/25/16

19

2.5DMatrixMul/plica/on

•  Assumecanfitcn2/Pdataperprocessor,c>1•  Processorsform(P/c)1/2x(P/c)1/2xcgrid

k

j

iIni/allyP(i,j,0)ownsA(i,j)andB(i,j)eachofsizen(c/P)1/2xn(c/P)1/2

(1)P(i,j,0)broadcastsA(i,j)andB(i,j)toP(i,j,k)(2)Processorsatlevelkperform1/c-thofSUMMA,i.e.1/c-thofΣmA(i,m)*B(m,j)(3)Sum-reducepar/alsumsΣmA(i,m)*B(m,j)alongk-axissoP(i,j,0)ownsC(i,j)

2.5DMatmulonBG/P,16Knodes/64Kcores

0

20

40

60

80

100

8192 131072

Perc

enta

ge o

f m

ach

ine p

eak

n

Matrix multiplication on 16,384 nodes of BG/P

12X faster

2.7X faster

Using c=16 matrix copies

2D MM2.5D MM

1/25/16

20

2.5DMatmulonBG/P,16Knodes/64Kcores

0

0.2

0.4

0.6

0.8

1

1.2

1.4

n=8192, 2D

n=8192, 2.5D

n=131072, 2D

n=131072, 2.5D

Exe

cutio

n tim

e n

orm

aliz

ed b

y 2D

Matrix multiplication on 16,384 nodes of BG/P

95% reduction in comm computationidle

communication

c=16copies

Dis:nguishedPaperAward,EuroPar’11SC’11paperbySolomonik,Bhatele,D.

12xfaster

2.7xfaster

PerfectStrongScaling–inTimeandEnergy(1/2)•  Every/meyouaddaprocessor,youshoulduseitsmemoryMtoo•  Startwithminimalnumberofprocs:PM=3n2•  IncreasePbyafactorofcètotalmemoryincreasesbyafactorofc•  Nota/onfor/mingmodel:

–  γT,βT,αT=secsperflop,perword_moved,permessageofsizem•  T(cP)=n3/(cP)[γT+βT/M1/2+αT/(mM1/2)]=T(P)/c•  Nota/onforenergymodel:

–  γE,βE,αE=joulesforsameopera/ons–  δE=joulesperwordofmemoryusedpersec–  εE=joulespersecforleakage,etc.

•  E(cP)=cP{n3/(cP)[γE+βE/M1/2+αE/(mM1/2)]+δEMT(cP)+εET(cP)}=E(P)•  Limit:c≤P1/3(3Dalgorithm),ifstar/ngwith1copyofinputs

1/25/16

21

PerfectStrongScaling–inTimeandEnergy(2/2)•  PerfectscalingextendstoN-body,Strassen,…•  Canprovelowerboundsonnetwork(eg3Dtorusformatmul)•  Wecanusethesemodelstoanswermanyques/ons,including:

•  Whatistheminimumenergyrequiredforacomputa/on?•  Givenamaximumallowedrun/meT,whatistheminimum

energyEneededtoachieveit?•  GivenamaximumenergybudgetE,whatistheminimum

run/meTthatwecanalain?•  Thera/oP=E/Tgivesustheaveragepowerrequiredtorun

thealgorithm.Canweminimizetheaveragepowerconsumed?•  Givenanalgorithm,problemsize,numberofprocessorsand

targetenergyefficiency(GFLOPS/W),canwedetermineasetofarchitecturalparameterstodescribeaconformingcomputerarchitecture?

•  SeeAndrewGearhart’sPhDthesis

HandlingHeterogeneity

•  SupposeeachofPprocessorscoulddiffer–  γi=sec/flop,βi=sec/word,αi=sec/message,Mi=memory

•  Whatisop/malassignmentofworkFitominimize/me?–  Ti=Fiγi+Fiβi/Mi

1/2+Fiαi/Mi3/2=Fi[γi+βi/Mi

1/2+αi/Mi3/2]=Fiξi

–  ChooseFisoΣiFi=n3andminimizingT=maxiTi

–  Answer:Fi=n3(1/ξi)/Σj(1/ξj)andT=n3/Σj(1/ξj)•  Op/malAlgorithmfornxnmatmul

–  Recursivelydivideinto8half-sizedsubproblems–  AssignsubproblemstoprocessoritoadduptoFiflops

•  WorksforStrassen,otheralgorithms…

1/25/16

22

Applica/ontoTensorContrac/ons

•  Ex:C(i,j,k)=ΣmnA(i,j,m,n)*B(m,n,k)–  Communica/onlowerboundsapply

•  Complexsymmetriespossible–  Ex:B(m,n,k)=B(k,m,n)=…–  d-foldsymmetrycansaveuptod!-foldflops/memory

•  Heavilyusedinelectronicstructurecalcula/ons–  Ex:NWChem

•  CTF:CyclopsTensorFramework–  Exploits2.5Dalgorithms,symmetries

–  Solomonik,Hammond,Malhews

C(i,j,k)=ΣmA(i,j,m)*B(m,k)

A3-foldsymm

B2-foldsymm

C2-foldsymm

1/25/16

23

Applica/ontoTensorContrac/ons

•  Ex:C(i,j,k)=ΣmnA(i,j,m,n)*B(m,n,k)–  Communica/onlowerboundsapply

•  Complexsymmetriespossible–  Ex:B(m,n,k)=B(k,m,n)=…–  d-foldsymmetrycansaveuptod!-foldflops/memory

•  Heavilyusedinelectronicstructurecalcula/ons–  Ex:NWChem,forcoupledcluster(CC)approachtoSchroedingereqn.

•  CTF:CyclopsTensorFramework–  Exploits2.5Dalgorithms,symmetries–  Upto3xfasterrunningCCthanNWChemon3072coresofCrayXE6–  Solomonik,Hammond,Malhews

TSQR:QRofaTall,Skinnymatrix

46

W=

Q00R00Q10R10Q20R20Q30R30

W0W1W2W3

Q00Q10Q20Q30

= = .

R00R10R20R30

R00R10R20R30

=Q01R01Q11R11

Q01Q11

= . R01R11

R01R11

= Q02R02

1/25/16

24

TSQR:QRofaTall,Skinnymatrix

47

W=

Q00R00Q10R10Q20R20Q30R30

W0W1W2W3

Q00Q10Q20Q30

= = .

R00R10R20R30

R00R10R20R30

=Q01R01Q11R11

Q01Q11

= . R01R11

R01R11

= Q02R02

Output={Q00,Q10,Q20,Q30,Q01,Q11,Q02,R02}

TSQR:AnArchitecture-DependentAlgorithm

W=

W0W1W2W3

R00R10R20R30

R01

R11

R02Parallel:

W=

W0W1W2W3

R01 R02

R00

R03Sequen/al:

W=

W0W1W2W3

R00R01

R01R11

R02

R11R03

DualCore:

Canchoosereduc/ontreedynamicallyMul/core/Mul/socket/Mul/rack/Mul/site/Out-of-core:?

1/25/16

25

TSQRPerformanceResults•  ParallelSpeedups

– Upto8xon8coreIntelClovertown– Upto6.7xon16processorPen/umcluster– Upto4xon32processorIBMBlueGene– Upto13xonNVidiaGPU– Upto4xon4ci/esvs1city(Dongarra,Langouetal)– Only1.6xsloweronCloudthanjustaccessingdatatwice(GleichandBenson)

•  Sequen/alSpeedup– “Infinite”forout-of-coreonPowerPClaptop

•  SVDcostsaboutthesame•  JointworkwithGrigori,Hoemmen,Langou,Anderson,Ballard,Keutzer,others

49DatafromGreyBallard,MarkHoemmen,LauraGrigori,JulienLangou,JackDongarra,MichaelAnderson

UsingsimilarideaforTSLUasTSQR:Usereduc/ontree,todo“TournamentPivo/ng”

50

Wnxb =

W1 W2 W3 W4

P1·L1·U1 P2·L2·U2 P3·L3·U3 P4·L4·U4

=

Choose b pivot rows of W1, call them W1’ Choose b pivot rows of W2, call them W2’ Choose b pivot rows of W3, call them W3’ Choose b pivot rows of W4, call them W4’

W1’ W2’ W3’ W4’

P12·L12·U12 P34·L34·U34

= Choose b pivot rows, call them W12’ Choose b pivot rows, call them W34’

W12’ W34’

= P1234·L1234·U1234

Choose b pivot rows

•  Go back to W and use these b pivot rows •  Move them to top, do LU without pivoting •  Extra work, but lower order term

•  Thm: As numerically stable as Partial Pivoting on a larger matrix

1/25/16

26

LUSpeedupsfromTournamentPivo/ngand2.5D

0

20

40

60

80

100

256 512 1024 2048

Perc

enta

ge o

f m

ach

ine p

eak

#nodes

2.5D LU with CA-pivoting on BG/P (n=65,536)

2.5D LU (CA-pvt)2D LU (CA-pvt)

ScaLAPACK PDGETRF

2.5Dvs2DLUWithandWithoutPivo/ng

0

20

40

60

80

100

NO-pivot 2D

NO-pivot 2.5D

CA-pivot 2D

CA-pivot 2.5D

Tim

e (

sec)

LU on 16,384 nodes of BG/P (n=131,072)

2X faster

2X faster

computeidle

communication

Thm:PerfectStrongScalingimpossible,becauseLatency*Bandwidth=Ω(n2)

1/25/16

27

Other CA algorithms •  Need for pivoting arises beyond LU, in QR

–  Choose permutation P so that leading columns of A*P = Q*R span column space of A – Rank Revealing QR (RRQR)

–  Usual approach like Partial Pivoting •  Put longest column first, update rest of matrix, repeat •  Hard to do using BLAS3 at all, let alone hit lower bound

–  Use Tournament Pivoting •  Each round of tournament selects best b columns from

two groups of b columns, either using usual approach or something better (Gu/Eisenstat)

•  Thm: This approach ``reveals the rank’’ of A in the sense that the leading rxr submatrix of R has singular values “near” the largest r singular values of A; ditto for trailing submatrix

–  Idea extends to other pivoting schemes •  Cholesky with diagonal pivoting •  LU with complete pivoting •  LDLT with complete pivoting 53"

Communica/onLowerBoundsforStrassen-likematmulalgorithms

•  Proof:graphexpansion(differentfromclassicalmatmul)–  Strassen-like:DAGmustbe“regular”andconnected

•  ExtendsuptoM=n2/p2/ω

•  BestPaperPrize(SPAA’11),Ballard,D.,Holtz,SchwartzappearedinJACM

•  Isthelowerboundalainable?

ClassicalO(n3)matmul:

#words_moved=Ω(M(n/M1/2)3/P)

Strassen’sO(nlg7)matmul:

#words_moved=Ω(M(n/M1/2)lg7/P)

Strassen-likeO(nω)matmul:

#words_moved=Ω(M(n/M1/2)ω/P)

1/25/16

28

vs.

Runsall7mul/pliesinparallelEachonP/7processorsNeeds7/4asmuchmemory

Runsall7mul/pliessequen/allyEachonallPprocessorsNeeds1/4asmuchmemory

CAPSIfEnoughMemoryandP≥7thenBFSstepelseDFSstependif

Communication Avoiding Parallel Strassen (CAPS)

Inprac/ce,howtobestinterleaveBFSandDFSisa“tuningparameter”

56

Performance Benchmarking, Strong Scaling Plot Franklin (Cray XT4) n = 94080

Speedups:24%-184%(overpreviousStrassen-basedalgorithms)

InvitedtoappearasResearchHighlightinCACM

1/25/16

29

Whataboutfastalgorithmsfortherestoflinearalgebra?

•  “FastMatrixMul/plica/onisStable”–  JD,I.Dumitriu,O.Holtz,R.Kleinberg(2007)

•  “FastLinearAlgebraisStable”–  JD,I.Dumitriu,O.Holtz(2007)

•  “Sequen/alCommunica/onBoundsforFastLinearAlgebra”– G.Ballard,JD,O.Holtz,O.Schwartz(EECS-2012-36)

•  Parallelcommunica/onbounds?•  Implementa/ons?

SymmetricBandReduc/on

•  GreyBallardandNickKnight•  A⇒QAQT=T,where

–  A=ATisbanded–  Ttridiagonal–  SimilarideaforSVDofabandmatrix

•  Usealone,orassecondphasewhenAisdense:–  Dense⇒Banded⇒Tridiagonal

•  ImplementedinLAPACK’ssytrd•  Algorithmdoesnotsa/sfycommunica/onlowerboundtheoremforapplyingorthogonaltransforma/ons–  Itcancommunicateevenless!

1/25/16

30

Conven/onalvsCA-SBR

Conven/onal Communica/on-Avoiding

ManytuningparametersRightchoicesreduce#words_movedbyfactorM/bw,notjustM1/2

Touchalldata4/mes Touchalldataonce

Manytuningparameters:Numberof“sweeps”,#diagonalsclearedpersweep,sizesofparallelograms#bulgeschasedatone/me,howfartochaseeachbulgeRightchoicesreduce#words_movedbyfactorM/bw,notjustM1/2

SpeedupsofSym.BandReduc/onvsLAPACK’sDSBTRD

•  Upto17xonIntelGainestown,vsMKL10.0–  n=12000,b=500,8threads

•  Upto12xonIntelWestmere,vsMKL10.3–  n=12000,b=200,10threads

•  Upto25xonAMDBudapest,vsACML4.4–  n=9000,b=500,4threads

•  Upto30xonAMDMagny-Cours,vsACML4.4–  n=12000,b=500,6threads

•  NeitherMKLnorACMLbenefitsfrommul/threadinginDSBTRD–  Bestsequen/alspeedupvsMKL:1.9x–  Bestsequen/alspeedupvsACML:8.5x

1/25/16

31

What about sparse matrices? (1/3) •  If matrix quickly becomes dense, use dense algorithm •  Ex: All Pairs Shortest Path using Floyd-Warshall •  Similar to matmul: Let D = A, then

•  But can’t reorder outer loop for 2.5D, need another idea •  Abbreviate D(i,j) = min(D(i,j),mink(A(i,k)+B(k,j)) by D = A¤B

–  Dependencies ok, 2.5D works, just different semiring •  Kleene’s Algorithm:

61"

for k = 1:n, for i = 1:n, for j=1:n D(i,j) = min(D(i,j), D(i,k) + D(k,j))

D = DC-APSP(A,n) D = A, Partition D = [[D11,D12];[D21,D22]] into n/2 x n/2 blocks D11 = DC-APSP(D11,n/2), D12 = D11 ! D12, D21 = D21 ! D11, D22 = D21 ! D12, D22 = DC-APSP(D22,n/2), D21 = D22 ! D21, D12 = D12 ! D22, D11 = D12 ! D21,

Performance of 2.5D APSP using Kleene

62"

c=1

0

200

400

600

800

1000

1200

1 4 16 64 256

1024 1 4 16 64 25

610

24

GFl

ops

Number of compute nodes

n=4096

n=8192

c=16c=4

Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)

6.2x speedup

2x speedup

1/25/16

32

What about sparse matrices? (2/3) •  If parts of matrix becomes dense, optimize those •  Ex: Cholesky on matrix A with good separators •  Thm (Lipton,Rose,Tarjan,’79) If all balanced separators

of G(A) have at least w vertices, then G(chol(A)) has clique of size w –  Need to do dense Cholesky on w x w submatrix

•  Thm: #Words_moved = Ω(w3/M1/2) etc •  Thm (George,’73) Nested dissection gives optimal

ordering for 2D grid, 3D grid, similar matrices –  w = n for 2D n x n grid, w = n2 for 3D n x n x n grid

•  Sequential multifrontal Cholesky attains bounds •  PSPACES (Gupta, Karypis, Kumar) is a parallel sparse

multifrontal Cholesky package –  Attains 2D and 2.5D lower bounds (using optimal dense Cholesky on

separators) 63"

What about sparse matrices? (3/3) •  Ifmatrixstaysverysparse,lowerboundunalainable,newone?•  Ex:A*B,bothdiagonal:nocommunica/oninparallelcase•  Ex:A*B,bothareErdos-Renyi:Prob(A(i,j)≠0)=d/n,d<<n1/2,iid•  Assump/on:Algorithmissparsity-independent:assignmentof

dataandworktoprocessorsissparsity-palern-independent(butzeroentriesneednotbecommunicatedoroperatedon)

•  Thm:Aparallelalgorithmthatissparsity-independentandloadbalancedforErdos-Renyimatmulsa/sfies(inexpecta/on)

#Words_moved=Ω(min(dn/P1/2,d2n/P))–  ProofexploitsfactthatreuseofentriesofC=A*Bunlikely

•  Contrastgenerallowerbound:#Words_moved=Ω(d2n/(PM1/2)))•  Alainedbydivide-and-conqueralgorithmthatsplitsmatrices

alongdimensionsmostlikelytominimizecost•  Recentresult(P.Koanantakooletal,IPDPS’16):Dense*Sparse

64"

1/25/16

33

SummaryofDirectLinearAlgebra(1/2)•  Newlowerbounds,op/malalgorithms,bigspeedupsintheoryandprac/ce•  Lotsofongoingwork/possibleprojectson

–  Algorithms:•  LDLT,QRwithpivo/ng,otherpivo/ngschemes,lowrankfactoriza/ons,eigenproblems…•  All-pairs-shortest-path,…•  Both2D(c=1)and2.5D(c>1)•  Butonlybandwidthmaydecreasewithc>1,notlatency•  Sparsematrices

–  Pla�orms:•  Mul/core,cluster,GPU,cloud,heterogeneous,low-energy,…

–  Somware:•  Integra/onintoSca/LAPACK,PLASMA,MAGMA,…

•  Integra/onofCTFintoquantumchemistry/DFTapplica/ons–  Aquarius,withANL,UTAus/nonIBMBG/Q,CrayXC30–  Qbox,withLLNL,IBM,onIBMBG/Q–  Q-Chem,workinprogress

•  Integra/onintobigdataanalysissystembasedonSparkatAMPLab

SummaryofDirectLinearAlgebra(2/2)•  Newlowerbounds,op/malalgorithms,bigspeedupsintheoryandprac/ce

•  Lotsofongoingwork/possibleprojectson–  Algorithms:

•  LDLT,QRwithpivo/ng,otherpivo/ngschemes,lowrankfactoriza/ons,eigenproblems…

•  ComparefastQR(•  All-pairs-shortest-path,…•  Both2D(c=1)and2.5D(c>1)•  Butonlybandwidthmaydecreasewithc>1,notlatency•  Sparsematrices

–  Pla�orms:•  Mul/core,cluster,GPU,cloud,heterogeneous,low-energy,…

–  Somware:•  Integra/onintoSca/LAPACK,PLASMA,MAGMA,…

1/25/16

34



•  Diloforprogramsaccessingarrays(egn-body)•  Dilofor“Itera/ve”LinearAlgebra•  RelatedWork

Recallop/malsequen/alMatmul•  Naïvecodefori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)•  “Blocked”codefori=1:n/b,forj=1:n/b,fork=1:n/bC[i,j]+=A[i,k]*B[k,j]…bxbmatmul•  Thm:Pickingb=M1/2alainslowerbound:#words_moved=Ω(n3/M1/2)•  Wheredoes1/2comefrom?

1/25/16

35

NewThmappliedtoMatmul•  fori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)•  RecordarrayindicesinmatrixΔ

•  SolveLPforx=[xi,xj,xk]T:max1Txs.t.Δx≤1

– Result:x=[1/2,1/2,1/2]T,1Tx=3/2=e•  Thm:#words_moved=Ω(n3/Me-1)=Ω(n3/M1/2)AlainedbyblocksizesMxi,Mxj,Mxk=M1/2,M1/2,M1/2

i j k1 0 1 A

Δ= 0 1 1 B

1 1 0 C

NewThmappliedtoDirectN-Body•  fori=1:n,forj=1:n,F(i)+=force(P(i),P(j))•  RecordarrayindicesinmatrixΔ

•  SolveLPforx=[xi,xj]T:max1Txs.t.Δx≤1

– Result:x=[1,1],1Tx=2=e•  Thm:#words_moved=Ω(n2/Me-1)=Ω(n2/M1)AlainedbyblocksizesMxi,Mxj=M1,M1

i j1 0 F

Δ= 1 0 P(i)

0 1 P(j)

1/25/16

36

N-BodySpeedupsonIBM-BG/P(Intrepid)8Kcores,32Kpar/cles

11.8xspeedup

K.Yelick,E.Georganas,M.Driscoll,P.Koanantakool,E.Solomonik

04/07/2015 CS267 Lecture 21

Some Applications •  Gravity, Turbulence, Molecular Dynamics, Plasma

Simulation, … •  Electron-Beam Lithography Device Simulation •  Hair ...

– www.fxguide.com/featured/brave-new-hair/ –  graphics.pixar.com/library/CurlyHairA/paper.pdf

72

1/25/16

37

NewThmappliedtoRandomCode•  fori1=1:n,fori2=1:n,…,fori6=1:nA1(i1,i3,i6)+=func1(A2(i1,i2,i4),A3(i2,i3,i5),A4(i3,i4,i6))A5(i2,i6)+=func2(A6(i1,i4,i5),A3(i3,i4,i6))•  RecordarrayindicesinmatrixΔ

•  SolveLPforx=[x1,…,x6]T:max1Txs.t.Δx≤1

–  Result:x=[2/7,3/7,1/7,2/7,3/7,4/7],1Tx=15/7=e•  Thm:#words_moved=Ω(n6/Me-1)=Ω(n6/M8/7)AlainedbyblocksizesM2/7,M3/7,M1/7,M2/7,M3/7,M4/7

i1 i2 i3 i4 i5 i6

1 0 1 0 0 1 A1

1 1 0 1 0 0 A2

Δ= 0 1 1 0 1 0 A3

0 0 1 1 0 1 A3,A4

0 0 1 1 0 1 A5

1 0 0 1 1 0 A6

Approachtogeneralizinglowerbounds•  Matmulfori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)=>for(i,j,k)inS=subsetofZ3

Accessloca/onsindexedby(i,j),(i,k),(k,j)•  Generalcasefori1=1:n,fori2=i1:m,…forik=i3:i4C(i1+2*i3-i7)=func(A(i2+3*i4,i1,i2,i1+i2,…),B(pnt(3*i4)),…)D(somethingelse)=func(somethingelse),…=>for(i1,i2,…,ik)inS=subsetofZkAccessloca/onsindexedby“projec/ons”,egφC(i1,i2,…,ik)=(i1+2*i3-i7)φA(i1,i2,…,ik)=(i2+3*i4,i1,i2,i1+i2,…),…•  Goal:Communica/onlowerboundsandop/malalgorithmsforany

programthatlookslikethis

•  Canwebound#loop_itera/ons/pointsinSgivenboundson#pointsinitsimagesφC(S),φA(S),…?

1/25/16

38

GeneralCommunica/onBound

•  Thm:Givenaprogramwitharrayrefsgivenbyprojec/onsφj,thenthereisane≥1suchthat

#words_moved=Ω(#itera/ons/Me-1)whereeisthethevalueofalinearprogram:minimizee=Σjejsubjecttorank(H)≤Σjej*rank(φj(H))forallsubgroupsH<Zk•  Proofdependsonrecentresultinpuremathema/csby

Christ/Tao/Carbery/Bennel

•  GivenSsubsetofZk,grouphomomorphismsφ1,φ2,…,bound|S|intermsof|φ1(S)|,|φ2(S)|,…,|φm(S)|

•  Thm(Christ/Tao/Carbery/Bennel):Givens1,…,sm|S|≤Πj|φj(S)|

sj

Isthisboundalainable(1/2)?•  Butfirst:Canwewriteitdown?

–  OneinequalitypersubgroupH<Zd,buts/llfinitelymany!–  Thm(badnews):Wri/ngdownallinequali/esinLPreducestoHilbert’s10thproblemoverQ

•  Couldbeundecidable:openques/on–  Thm(goodnews):AnotherLPhassamesolu/on,isdecidable(butexpensivesofar)

–  Thm:(belernews)EasytowriteLPdownexplicitlyinmanycasesofinterest:

• Whenatmost3arrays• Whenatmost4loopindices• Whensubscriptsaresubsetsofindices

–  Alsoeasytogetupper/lowerboundsone

•  Tarski-decidabletogetsupersetofconstraints(mayget.sHBLtoolarge)

1/25/16

39

Isthisboundalainable(2/2)?

•  Dependsonloopdependencies•  Bestcase:none,orreduc/ons(matmul)•  Thm:Whenallsubscriptsaresubsetsofindices,thesolu/onxofthedualLPgivesop/mal/lesizes:Mx1,Mx2,…

•  Ex:Linearalgebra,n-body,“randomcode,”join,…•  Conjecture:alwaysalainable(modulodependencies):workinprogress/classproject

•  Longtermgoal:incorporateincompilers

OngoingWork

•  Automategenera/onoflowerbounds•  Extend“perfectscaling”resultsfor/meandenergybyusingextramemory

•  Haveyettofindacasewherewecannotalainlowerbound(dependenciespermi�ng)–canweprovethis?

•  Incorporateintocompilers

1/25/16

40




AvoidingCommunica/oninItera/veLinearAlgebra

•  k-stepsofitera/vesolverforsparseAx=borAx=λx– DoeskSpMVswithAandstar/ngvector– Manysuch“KrylovSubspaceMethods”

•  Goal:minimizecommunica/on– Assumematrix“well-par//oned”–  Serialimplementa/on

•  Conven/onal:O(k)movesofdatafromslowtofastmemory•  New:O(1)movesofdata–op:mal

–  Parallelimplementa/ononpprocessors•  Conven/onal:O(klogp)messages(kSpMVcalls,dotprods)•  New:O(logp)messages-op:mal

•  Lotsofspeeduppossible(modeledandmeasured)–  Price:someredundantcomputa/on

80

1/25/16

41

1234… …32

x

A·x

A2·x

A3·x

Communica/onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]

•  Replacekitera/onsofy=A⋅xwith[Ax,A2x,…,Akx]

•  Example:Atridiagonal,n=32,k=3•  Worksforany“well-par//oned”A

1234… …32

x

A·x

A2·x

A3·x



•  Example:Atridiagonal,n=32,k=3

1/25/16

42

1234… …32

x

A·x

A2·x

A3·x




1234… …32

x

A·x

A2·x

A3·x




1/25/16

43

1234… …32

x

A·x

A2·x

A3·x




1234… …32

x

A·x

A2·x

A3·x




1/25/16

44

1234… …32

x

A·x

A2·x

A3·x


•  Replacekitera/onsofy=A⋅xwith[Ax,A2x,…,Akx]•  Sequen/alAlgorithm


Step1

1234… …32

x

A·x

A2·x

A3·x




Step1 Step2

1/25/16

45

1234… …32

x

A·x

A2·x

A3·x




Step1 Step2 Step3

1234… …32

x

A·x

A2·x

A3·x




Step1 Step2 Step3 Step4

1/25/16

46

1234… …32

x

A·x

A2·x

A3·x


•  Replacekitera/onsofy=A⋅xwith[Ax,A2x,…,Akx]•  ParallelAlgorithm

•  Example:Atridiagonal,n=32,k=3•  Eachprocessorcommunicatesoncewithneighbors

Proc1 Proc2 Proc3 Proc4

1234… …32

x

A·x

A2·x

A3·x


•  Replacekitera/onsofy=A⋅xwith[Ax,A2x,…,Akx]•  ParallelAlgorithm

•  Example:Atridiagonal,n=32,k=3•  Eachprocessorworkson(overlapping)trapezoid

Proc1 Proc2 Proc3 Proc4

1/25/16

47

Sameideaworksforgeneralsparsematrices


Simpleblock-rowpar//oningè(hyper)graphpar//oningLem-to-rightprocessingè TravelingSalesmanProblem

MinimizingCommunica/onofGMREStosolveAx=b•  GMRES:findxinspan{b,Ab,…,Akb}minimizing||Ax-b||2

StandardGMRESfori=1tokw=A·v(i-1)…SpMVMGS(w,v(0),…,v(i-1))updatev(i),HendforsolveLSQproblemwithH

Communica:on-avoidingGMRESW=[v,Av,A2v,…,Akv][Q,R]=TSQR(W)…“TallSkinnyQR”buildHfromRsolveLSQproblemwithH

• Oops–Wfrompowermethod,precisionlost!94

Sequen/alcase:#wordsmoveddecreasesbyafactorofkParallelcase:#messagesdecreasesbyafactorofk

1/25/16

48

“Monomial”basis[Ax,…,Akx]failstoconverge

Differentpolynomialbasis[p1(A)x,…,pk(A)x]doesconverge

95

SpeedupsofGMRESon8-coreIntelClovertown

[MHDY09]

96

RequiresCo-tuningKernels

1/25/16

49

97

CA-BiCGStab

1/25/16

50

Naive Monomial Newton Chebyshev

ReplacementIts. 74(1) [7,15,24,31,…,92,97,103](17)

[67,98](2) 68(1)

WithResidualReplacement(RR)alaVanderVorstandYe

SampleApplica/onSpeedups

100

• GeometricMul/grid(GMG)wCABolomSolver•  ComparedBICGSTABvs.CA-BICGSTABwiths=4•  HopperatNERSC(CrayXE6),weakscaling:Upto4096MPIprocesses(24,576corestotal)

•  SpeedupsforminiGMGbenchmark(HPGMGbenchmarkpredecessor)– 4.2xinbolomsolve,2.5xoverallGMGsolve

•  Implementedasasolverop/oninBoxLibandCHOMBOAMRframeworks

– 3DLMC(alow-machnumbercombus/oncode)•  2.5xinbolomsolve,1.5xoverallGMGsolve

– 3DNyx(anN-bodyandgasdynamicscode)•  2xinbolomsolve,1.15xoverallGMGsolve

•  SolveHorn-SchunckOp/calFlowEqua/ons•  ComparedCGvs.CA-CGwiths=3,43%fasteronNVIDIAGT640GPU

1/25/16

51

TuningspaceforKrylovMethods

Explicit(O(nnz)) Implicit(o(nnz))

Explicit(O(nnz)) CSRandvaria/ons Vision,climate,AMR,…

Implicit(o(nnz)) GraphLaplacian Stencils

Nonzeroentries

Indices

• Classifica/onsofsparseoperatorsforavoidingcommunica/on• Explicitindicesornonzeroentriescausemostcommunica/on,alongwithvectors• Ex:Withstencils(allimplicit)allcommunica/onforvectors

• Opera/ons• [x,Ax,A2x,…,Akx]or[x,p1(A)x,p2(A)x,…,pk(A)x]• Numberofcolumnsinx• [x,Ax,A2x,…,Akx]and[y,ATy,(AT)2y,…,(AT)ky],or[y,ATAy,(ATA)2y,…,(ATA)ky],• returnallvectorsorjustlastone

• Cotuningand/orinterleaving• W=[x,Ax,A2x,…,Akx]and{TSQR(W)orWTWor…}• Dilo,butthrowawayW

• Precondi/onedversions

SummaryofItera/veLinearAlgebra

•  NewLowerbounds,op/malalgorithms,bigspeedupsintheoryandprac/ce

•  Lotsofotherprogress,openproblems–  Manydifferentalgorithmsreorganized

•  Moreunderway,moretobedone– Needtorecognizestablevariantsmoreeasily–  Precondi/oning

•  “Underlapping”insteadof“overlapping”DomainDecomposi/on•  HierarchicallySemiseparableMatrices

– Autotuningandsynthesis•  pOSKIforSpMV–availableatbebop.cs.berkeley.edu•  Differentkindsof“sparsematrices”

1/25/16

52




•  Write-AvoidingAlgorithms•  Reproducibility

Write-AvoidingAlgorithms•  Whatifwritesaremoreexpensivethanreads?

– Nonvola/leMemory(Flash,PCM,…)–  Savingintermediatestodiskincloud(egSpark)–  Extracoherencytrafficinsharedmemory

•  Canwedesign“write-avoiding(WA)”algorithms?– Goal:findandalainbelerlowerboundforwrites–  Thm:Forclassicalmatmul,possibletodoasympto/callyfewerwritesthanreadstogivenlayerofmemoryhierarchy

–  Thm:Cache-obliviousalgorithmscannotbewrite-avoiding–  Thm:StrassenandFFTcannotbewrite-avoiding

1/25/16

53

128$ 256$ 512$ 1K$ 2K$ 4K$ 8K$ 16K$ 32K$L3_VICTIMS.M$ 1.9$ 2.1$ 1.8$ 2.3$ 4.8$ 9.8$ 19.5$ 39.6$ 78.5$

L3_VICTIMS.E$ 0.4$ 0.8$ 1.6$ 4.2$ 8.8$ 17.9$ 36.6$ 75.4$ 147.5$

LLC_S_FILLS.E$ 2.4$ 2.8$ 3.8$ 6.9$ 14$ 28.1$ 56.5$ 115.5$ 226.6$

Misses$on$Ideal$Cache$ 2.512$ 3.024$ 4.048$ 6$ 12$ 24$ 48$ 96$ 192$

Write$L.B.$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$

0.1$

1$

10$

100$

1000$

Blocking Sizes

L3: CO L2: CO L1: CO

128$ 256$ 512$ 1K$ 2K$ 4K$ 8K$ 16K$ 32K$L3_VICTIMS.M$ 2$ 2$ 1.9$ 2$ 2.3$ 2.7$ 3$ 3.6$ 4.4$

L3_VICTIMS.E$ 0.4$ 0.9$ 1.9$ 4.2$ 11.8$ 25.3$ 50.7$ 101.7$ 203.2$

LLC_S_FILLS.E$ 2.5$ 3$ 4.1$ 6.5$ 14.5$ 28.3$ 54$ 105.7$ 208.2$

Write$L.B.$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$

0.1$

1$

10$

100$

1000$

Blocking Sizes

L3: 700 L2: MKL L1: MKL

Cache-ObliviousMatmul#DRAMreadsclosetoop/mal#DRAMwritesmuchlarger

MeasuredL3-DRAMtrafficonIntelNehalemXeon-7560Op/mal#DRAMreads=O(n3/M1/2)Op/mal#DRAMwrites=n2

Write-AvoidingMatmulTotalL3missesclosetoop/malTotalDRAMwritesmuchlarger

ReproducibleFloa/ngPointComputa/on

•  Doyougetthesameanswerifyourunthesameprogramtwicewiththesameinput?– Notevenonyourmul/corelaptop!

•  Floa/ngpointaddi/onisnonassocia/ve,summa/onordernotreproducible

•  FirstreleaseoftheReproBLAS–  ReproducibleBLAS1,independentofdataorder,numberofprocessors,datalayout,reduc/ontree,…

–  Sequen/alanddistributedmemory(MPI)•  bebop.cs.berkeley.edu/reproblas

1/25/16

54

Morepossibleclassprojects

•  Couldbeoneormoreof– Extendlowerbounds,newalgorithms,compareexis/ngalgorithms,usealgorithmstoimproveexis/ngapplica/ons,performanceanalysis/measurement,compilerinfrastructure,presentexis/ngresults…

•  Extendtomachinelearningalgorithms,otherofthe``13mo/fs’’

Formoredetails

•  Bebop.cs.berkeley.edu–  155pagesurveyinActaNumerica

•  CS267–Berkeley’sParallelCompu/ngCourse–  LivebroadcastinSpring2016

•  www.cs.berkeley.edu/~demmel•  Allslides,videoavailable

–  PrerecordedversionbroadcastsinceSpring2013•  www.xsede.org•  Freesupercomputeraccountstodohomework•  Freeautogradingofhomework

1/25/16

55

CollaboratorsandSupporters•  JamesDemmel,KathyYelick,MichaelAnderson,GreyBallard,ErinCarson,Aditya

Devarakonda,MichaelDriscoll,DavidEliahu,AndrewGearhart,EvangelosGeorganas,NicholasKnight,PenpornKoanantakool,BenLipshitz,OdedSchwartz,EdgarSolomonik,OmerSpillinger

•  Aus/nBenson,MaryamDehnavi,MarkHoemmen,ShoaibKamil,MarghoobMohiyuddin•  AbhinavBhatele,AydinBuluc,MichaelChrist,IoanaDumitriu,ArmandoFox,David

Gleich,MingGu,JeffHammond,MikeHeroux,OlgaHoltz,KurtKeutzer,JulienLangou,DevinMalhews,TomScanlon,MichelleStrout,SamWilliams,HuaXiang

•  JackDongarra,JakubKurzak,DulceneiaBecker,IchitaroYamazaki,…•  SivanToledo,AlexDruinsky,InonPeled•  LauraGrigori,Sebas/enCayrols,SimpliceDonfack,MathiasJacquelin,AmalKhabou,

SophieMoufawad,MikolajSzydlarski•  MembersofParLab,ASPIRE,BEBOP,CACHE,EASI,FASTMath,MAGMA,PLASMA•  ThankstoDOE,NSF,UCDiscovery,INRIA,Intel,Microsom,Mathworks,Na/onal

Instruments,NEC,Nokia,NVIDIA,Samsung,Oracle•  bebop.cs.berkeley.edu

Summary

Don’tCommunic…

110

Timetoredesignalllinearalgebra,n-body,…algorithmsandsomware

(andcompilers…)

introduc/on to communicaon-avoiding algorithms · 1/25/16 3 why minimize communicaon? (2/2) 1 10...

Documents