introduc/on to communicaon-avoiding algorithms · 1/25/16 3 why minimize communicaon? (2/2) 1 10...
TRANSCRIPT
1/25/16
1
Introduc/ontoCommunica/on-AvoidingAlgorithms
www.cs.berkeley.edu/~demmel
JimDemmel,andmany,manyothers…
2
Whyavoidcommunica/on?(1/2)Algorithmshavetwocosts(measuredin/meorenergy):1. Arithme/c(FLOPS)2. Communica/on:movingdatabetween
– levelsofamemoryhierarchy(sequen/alcase)– processorsoveranetwork(parallelcase).
CPUCache
DRAM
CPUDRAM
CPUDRAM
CPUDRAM
CPUDRAM
1/25/16
2
Whyavoidcommunica/on?(2/3)• Running/meofanalgorithmissumof3terms:
– #flops*/me_per_flop– #wordsmoved/bandwidth– #messages*latency
3"
communica/on
• Time_per_flop<<1/bandwidth<<latency• Gapsgrowingexponen/allywith/me[FOSC]
• Avoidcommunica/ontosave/me
• Goal:reorganizealgorithmstoavoidcommunica/on• Betweenallmemoryhierarchylevels
• L1L2DRAMnetwork,etc• Verylargespeedupspossible• Energysavingstoo!
Annualimprovements
Time_per_flop Bandwidth Latency
Network 26% 15%
DRAM 23% 5%
59%
WhyMinimizeCommunica/on?(2/2)
1
10
100
1000
10000
PicoJoules
now
2018
Source:JohnShalf,LBL
1/25/16
3
WhyMinimizeCommunica/on?(2/2)
1
10
100
1000
10000
PicoJoules
now
2018
Source:JohnShalf,LBL
Minimizecommunica/ontosaveenergy
Alterna/veCostModelforAlgorithms?
Totaldistancemovedbybeadsonanabacus
1/25/16
4
Goals
7"
• Redesignalgorithmstoavoidcommunica/on• Betweenallmemoryhierarchylevels
• L1L2DRAMnetwork,etc
• Alainlowerboundsifpossible• Currentalgorithmsomenfarfromlowerbounds• Largespeedupsandenergysavingspossible
• Lotsofopenproblems/poten/alclassprojects
SampleSpeedups• Up to 12x faster for 2.5D matmul on 64K core IBM BG/P
• Up to 3x faster for tensor contractions on 2K core Cray XE/6
• Up to 6.2x faster for All-Pairs-Shortest-Path on 24K core Cray CE6
• Up to 2.1x faster for 2.5D LU on 64K core IBM BG/P
• Up to 11.8x faster for direct N-body on 32K core IBM BG/P
• Up to 13x faster for Tall Skinny QR on Tesla C2050 Fermi NVIDIA GPU
• Up to 6.7x faster for symeig(band A) on 10 core Intel Westmere
• Up to 2x faster for 2.5D Strassen on 38K core Cray XT4
• Up to 4.2x faster for MiniGMG benchmark bottom solver, using CA-BiCGStab (2.5x for overall solve)
– 2.5x / 1.5x for combustion simulation code
8
1/25/16
5
“NewAlgorithmImprovesPerformanceandAccuracyonExtreme-ScaleCompu/ngSystems.Onmoderncomputerarchitectures,communica:onbetweenprocessorstakeslongerthantheperformanceofafloa:ngpointarithme:copera:onbyagivenprocessor.ASCRresearchershavedevelopedanewmethod,derivedfromcommonlyusedlinearalgebramethods,tominimizecommunica:onsbetweenprocessorsandthememoryhierarchy,byreformula:ngthecommunica:onpaDernsspecifiedwithinthealgorithm.ThismethodhasbeenimplementedintheTRILINOSframework,ahighly-regardedsuiteofsomware,whichprovidesfunc/onalityforresearchersaroundtheworldtosolvelargescale,complexmul/-physicsproblems.”FY2010CongressionalBudget,Volume4,FY2010Accomplishments,AdvancedScien/ficCompu/ng
Research(ASCR),pages65-67.
PresidentObamacitesCommunica/on-AvoidingAlgorithmsintheFY2012DepartmentofEnergyBudgetRequesttoCongress:
CA-GMRES(Hoemmen,Mohiyuddin,Yelick,JD)“Tall-Skinny”QR(Grigori,Hoemmen,Langou,JD)
SummaryofCAAlgorithms• “Direct”LinearAlgebra
• Lowerboundsoncommunica/onforlinearalgebraproblemslikeAx=b,leastsquares,Ax=λx,SVD,etc
• Newalgorithmsthatalaintheselowerbounds• Beingaddedtolibraries:Sca/LAPACK,PLASMA,MAGMA
• Largespeed-upspossible• Autotuningtofindop/malimplementa/on
• Diloforprogramsaccessingarrays(egn-body)• Dilofor“Itera/ve”LinearAlgebra
1/25/16
6
Outline• “Direct”LinearAlgebra
• Lowerboundsoncommunica/on• Newalgorithmsthatalaintheselowerbounds
• Diloforprogramsaccessingarrays(egn-body)• Dilofor“Itera/ve”LinearAlgebra• Relatedwork
Outline• “Direct”LinearAlgebra
• Lowerboundsoncommunica/on• Newalgorithmsthatalaintheselowerbounds
• Diloforprogramsaccessingarrays(egn-body)• Dilofor“Itera/ve”LinearAlgebra• Relatedwork
1/25/16
7
Lowerboundforall“direct”linearalgebra
• Holdsfor– Matmul,BLAS,LU,QR,eig,SVD,tensorcontrac/ons,…– Somewholeprograms(sequencesoftheseopera/ons,nomalerhowindividualopsareinterleaved,egAk)
– Denseandsparsematrices(where#flops<<n3)– Sequen/alandparallelalgorithms– Somegraph-theore/calgorithms(egFloyd-Warshall)
13
• LetM=“fast”memorysize(perprocessor)
#words_moved(perprocessor)=Ω(#flops(perprocessor)/M1/2)
#messages_sent(perprocessor)=Ω(#flops(perprocessor)/M3/2)
• Parallelcase:assumeeitherloadormemorybalanced
Lowerboundforall“direct”linearalgebra
• Holdsfor– Matmul,BLAS,LU,QR,eig,SVD,tensorcontrac/ons,…– Somewholeprograms(sequencesoftheseopera/ons,nomalerhowindividualopsareinterleaved,egAk)
– Denseandsparsematrices(where#flops<<n3)– Sequen/alandparallelalgorithms– Somegraph-theore/calgorithms(egFloyd-Warshall)
14
• LetM=“fast”memorysize(perprocessor)
#words_moved(perprocessor)=Ω(#flops(perprocessor)/M1/2)
#messages_sent≥#words_moved/largest_message_size
• Parallelcase:assumeeitherloadormemorybalanced
1/25/16
8
Lowerboundforall“direct”linearalgebra
• Holdsfor– Matmul,BLAS,LU,QR,eig,SVD,tensorcontrac/ons,…– Somewholeprograms(sequencesoftheseopera/ons,nomalerhowindividualopsareinterleaved,egAk)
– Denseandsparsematrices(where#flops<<n3)– Sequen/alandparallelalgorithms– Somegraph-theore/calgorithms(egFloyd-Warshall)
15
• LetM=“fast”memorysize(perprocessor)
#words_moved(perprocessor)=Ω(#flops(perprocessor)/M1/2)
#messages_sent(perprocessor)=Ω(#flops(perprocessor)/M3/2)
• Parallelcase:assumeeitherloadormemorybalanced
SIAMSIAG/LinearAlgebraPrize,2012Ballard,D.,Holtz,Schwartz
Canwealaintheselowerbounds?• Doconven/onaldensealgorithmsasimplementedinLAPACKandScaLAPACKalainthesebounds?– Omennot
• Ifnot,arethereotheralgorithmsthatdo?– Yes,formuchofdenselinearalgebra– Newalgorithms,withnewnumericalproper/es,newwaystoencodeanswers,newdatastructures
– Notjustlooptransforma/ons• Onlyafewsparsealgorithmssofar• Lotsofworkinprogress/possibleprojects• Casestudy:MatrixMul/ply
16
1/25/16
9
17"
NaïveMatrixMul/ply{implementsC=C+A*B}fori=1tonforj=1tonfork=1tonC(i,j)=C(i,j)+A(i,k)*B(k,j)
= + * C(i,j) A(i,:)
B(:,j) C(i,j)
18"
NaïveMatrixMul/ply{implementsC=C+A*B}fori=1ton{readrowiofAintofastmemory}forj=1ton{readC(i,j)intofastmemory}{readcolumnjofBintofastmemory}fork=1tonC(i,j)=C(i,j)+A(i,k)*B(k,j){writeC(i,j)backtoslowmemory}
= + * C(i,j) A(i,:)
B(:,j) C(i,j)
1/25/16
10
19"
NaïveMatrixMul/ply{implementsC=C+A*B}fori=1ton{readrowiofAintofastmemory}…n2readsaltogetherforj=1ton{readC(i,j)intofastmemory}…n2readsaltogether{readcolumnjofBintofastmemory}…n3readsaltogetherfork=1tonC(i,j)=C(i,j)+A(i,k)*B(k,j){writeC(i,j)backtoslowmemory}…n2writesaltogether
= + * C(i,j) A(i,:)
B(:,j) C(i,j)
n3+3n2reads/writesaltogether–dominates2n3arithme/c
20"
Blocked(Tiled)MatrixMul/plyConsiderA,B,Ctoben/b-by-n/bmatricesofb-by-bsubblockswhere
biscalledtheblocksize;assume3b-by-bblocksfitinfastmemoryfori=1ton/b
forj=1ton/b {readblockC(i,j)intofastmemory} fork=1ton/b {readblockA(i,k)intofastmemory} {readblockB(k,j)intofastmemory} C(i,j)=C(i,j)+A(i,k)*B(k,j){doamatrixmul/plyonblocks} {writeblockC(i,j)backtoslowmemory}
= + * C(i,j) C(i,j) A(i,k)
B(k,j) b-by-bblock
1/25/16
11
21"
Blocked(Tiled)MatrixMul/plyConsiderA,B,Ctoben/b-by-n/bmatricesofb-by-bsubblockswhere
biscalledtheblocksize;assume3b-by-bblocksfitinfastmemoryfori=1ton/b
forj=1ton/b {readblockC(i,j)intofastmemory}…b2×(n/b)2=n2reads fork=1ton/b {readblockA(i,k)intofastmemory}…b2×(n/b)3=n3/breads {readblockB(k,j)intofastmemory}…b2×(n/b)3=n3/breads C(i,j)=C(i,j)+A(i,k)*B(k,j){doamatrixmul/plyonblocks} {writeblockC(i,j)backtoslowmemory}…b2×(n/b)2=n2writes
= + * C(i,j) C(i,j) A(i,k)
B(k,j) b-by-bblock
2n3/b+2n2reads/writes<<2n3arithme/c-Faster!
Doesblockedmatmulalainlowerbound?• Recall:if3b-by-bblocksfitinfastmemoryofsizeM,then#reads/writes=2n3/b+2n2
• Makebaslargeaspossible:3b2≤M,so#reads/writes≥2n3/(M/3)1/2+2n2
• Alainslowerbound=Ω(#flops/M1/2)
• Butwhatifwedon’tknowM?• Oriftherearemul/plelevelsoffastmemory?• Howdowewritethealgorithm?
22
1/25/16
12
Howhardishand-tuningmatmul,anyway?
23"
• Results of 22 student teams trying to tune matrix-multiply, in CS267 Spr09 • Students given “blocked” code to start with (7x faster than naïve)
• Still hard to get close to vendor tuned performance (ACML) (another 6x) • For more discussion, see www.cs.berkeley.edu/~volkov/cs267.sp09/hw1/results/
Howhardishand-tuningmatmul,anyway?
24"
1/25/16
13
RecursiveMatrixMul/plica/on(RMM)(1/2)• Forsimplicity:squarematriceswithn=2m
• C==A·B=··
=
• TruewheneachAijetc1x1orn/2xn/2
25"
A11 A12 A21 A22
B11 B12 B21 B22
C11 C12 C21 C22
A11·B11 + A12·B21 A11·B12 + A12·B22 A21·B11 + A22·B21 A21·B12 + A22·B22
func C = RMM (A, B, n) if n = 1, C = A * B, else { C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2) C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2) C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2) C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) } return
RecursiveMatrixMul/plica/on(RMM)(2/2)
26"
func C = RMM (A, B, n) if n=1, C = A * B, else { C11 = RMM (A11 , B11 , n/2) + RMM (A12 , B21 , n/2) C12 = RMM (A11 , B12 , n/2) + RMM (A12 , B22 , n/2) C21 = RMM (A21 , B11 , n/2) + RMM (A22 , B21 , n/2) C22 = RMM (A21 , B12 , n/2) + RMM (A22 , B22 , n/2) } return
A(n) = # arithmetic operations in RMM( . , . , n) = 8 · A(n/2) + 4(n/2)2 if n > 1, else 1 = 2n3 … same operations as usual, in different order W(n) = # words moved between fast, slow memory by RMM( . , . , n) = 8 · W(n/2) + 12(n/2)2 if 3n2 > M , else 3n2 = O( n3 / M1/2 + n2 ) … same as blocked matmul “Cache oblivious”, works for memory hierarchies, but not panacea
Forbigspeedups,seeSC12posteron“Bea/ngMKLandScaLAPACKatRectangularMatmul”11/13at5:15-7pm
1/25/16
14
CARMAPerformance:SharedMemory
Square:m=k=n
MKL(double)CARMA(double)
MKL(single)CARMA(single)
Peak(single)
Peak(double)
(log)
(linear)
PreliminariesLowerBoundsCARMABenchmarkingFutureWorkConclusion27
IntelEmerald:4IntelXeonX7560x8cores,4xNUMA
CARMAPerformance:SharedMemory
InnerProduct:m=n=64
MKL(double)
CARMA(double)
MKL(single)
CARMA(single)
(log)
(linear)
PreliminariesLowerBoundsCARMABenchmarkingFutureWorkConclusion28
IntelEmerald:4IntelXeonX7560x8cores,4xNUMA
1/25/16
15
WhyisCARMAFaster?L3CacheMisses
SharedMemoryInnerProduct(m=n=64;k=524,288)
97%FewerMisses
86%FewerMisses
(linear)
PreliminariesLowerBoundsCARMABenchmarkingFutureWorkConclusion29
ParallelMatMulwith2DProcessorLayout
• PprocessorsinP1/2xP1/2grid– Processorscommunicatealongrows,columns
• Eachprocessorownsn/P1/2xn/P1/2submatricesofA,B,C• Example:P=16,processorsnumberedfromP00toP33
– ProcessorPijownssubmatricesAij,BijandCij
P00P01P02P03
P10P11P12P13
P20P21P22P23
P30P31P32P33
P00P01P02P03
P10P11P12P13
P20P21P22P23
P30P31P32P33
P00P01P02P03
P10P11P12P13
P20P21P22P23
P30P31P32P33
C=A*B
1/25/16
16
31"
SUMMAAlgorithm• SUMMA=ScalableUniversalMatrixMul/ply
– Alainslowerbounds:• AssumefastmemorysizeM=O(n2/P)perprocessor–1copyofdata• #words_moved=Ω(#flops/M1/2)=Ω((n3/P)/(n2/P)1/2)=Ω(n2/P1/2)• #messages=Ω(#flops/M3/2)=Ω((n3/P)/(n2/P)3/2)=Ω(P1/2)
– Canaccommodateanyprocessorgrid,matrixdimensions&layout
– Usedinprac/ceinPBLAS=ParallelBLAS• www.netlib.org/lapack/lawns/lawn{96,100}.ps
• ComparisontoCannon’sAlgorithm– Cannonalainslowerbound– ButCannonhardertogeneralizetoothergrids,dimensions,layouts,andCannonmayusemore
memory
32"
SUMMA–nxnmatmulonP1/2xP1/2grid
• C(i, j) is n/P1/2 x n/P1/2 submatrix of C on processor Pij"• A(i,k) is n/P1/2 x b submatrix of A"• B(k,j) is b x n/P1/2 submatrix of B "• C(i,j) = C(i,j) + Σk A(i,k)*B(k,j) "
• summation over submatrices"• Need not be square processor grid "
* = i!
j!
A(i,k)!
k!k!B(k,j)!
C(i,j)
1/25/16
17
33"
SUMMA–nxnmatmulonP1/2xP1/2grid
For k=0 to n-1 … or n/b-1 where b is the block size … = # cols in A(i,k) and # rows in B(k,j) for all i = 1 to pr … in parallel owner of A(i,k) broadcasts it to whole processor row for all j = 1 to pc … in parallel owner of B(k,j) broadcasts it to whole processor column Receive A(i,k) into Acol Receive B(k,j) into Brow C_myproc = C_myproc + Acol * Brow
* = i!
j!
A(i,k)!
k!k!B(k,j)!
C(i,j)
For k=0 to n/b-1 for all i = 1 to P1/2
owner of A(i,k) broadcasts it to whole processor row (using binary tree) for all j = 1 to P1/2
owner of B(k,j) broadcasts it to whole processor column (using bin. tree) Receive A(i,k) into Acol Receive B(k,j) into Brow C_myproc = C_myproc + Acol * Brow
Brow
Acol
• Alainsbandwidthlowerbound• Alainslatencylowerboundifbnearmaximumn/P1/2
Summaryofdenseparallelalgorithmsalainingcommunica/onlowerbounds
34
• AssumenxnmatricesonPprocessors• MinimumMemoryperprocessor=M=O(n2/P)• Recalllowerbounds:
#words_moved=Ω((n3/P)/M1/2)=Ω(n2/P1/2)#messages=Ω((n3/P)/M3/2)=Ω(P1/2)
• DoesScaLAPACKalainthesebounds?• For#words_moved:mostly,exceptnonsym.Eigenproblem• For#messages:asympto/callyworse,exceptCholesky
• Newalgorithmsalainallbounds,uptopolylog(P)factors• Cholesky,LU,QR,Sym.andNonsymeigenproblems,SVD
• Neededtoreplacepar/alpivo/nginLU• Needrandomiza/onforNonsymeigenproblem(sofar)
CanwedoBeler?
1/25/16
18
Canwedobeler?• Aren’twealreadyop/mal?• WhyassumeM=O(n2/P),i.e.minimal?
– Lowerbounds/lltrueifmorememory– Canwealainit?– Specialcase:“3DMatmul”:usesM=O(n2/P2/3)
• Dekel,Nassimi,Sahni[81],Bernsten[89],Agarwal,Chandra,Snir[90],Johnson[93],Agarwal,Balle,Gustavson,Joshi,Palkar[95]
• ProcessorsarrangedinP1/3xP1/3xP1/3grid– M=O(n2/P2/3)isP1/3/mestheminimum
• Notalwaysthatmuchmemoryavailable…
2.5DMatrixMul/plica/on
• Assumecanfitcn2/Pdataperprocessor,c>1• Processorsform(P/c)1/2x(P/c)1/2xcgrid
c
(P/c)1/2
(P/c)1/2
Example:P=32,c=2
1/25/16
19
2.5DMatrixMul/plica/on
• Assumecanfitcn2/Pdataperprocessor,c>1• Processorsform(P/c)1/2x(P/c)1/2xcgrid
k
j
iIni/allyP(i,j,0)ownsA(i,j)andB(i,j)eachofsizen(c/P)1/2xn(c/P)1/2
(1)P(i,j,0)broadcastsA(i,j)andB(i,j)toP(i,j,k)(2)Processorsatlevelkperform1/c-thofSUMMA,i.e.1/c-thofΣmA(i,m)*B(m,j)(3)Sum-reducepar/alsumsΣmA(i,m)*B(m,j)alongk-axissoP(i,j,0)ownsC(i,j)
2.5DMatmulonBG/P,16Knodes/64Kcores
0
20
40
60
80
100
8192 131072
Perc
enta
ge o
f m
ach
ine p
eak
n
Matrix multiplication on 16,384 nodes of BG/P
12X faster
2.7X faster
Using c=16 matrix copies
2D MM2.5D MM
1/25/16
20
2.5DMatmulonBG/P,16Knodes/64Kcores
0
0.2
0.4
0.6
0.8
1
1.2
1.4
n=8192, 2D
n=8192, 2.5D
n=131072, 2D
n=131072, 2.5D
Exe
cutio
n tim
e n
orm
aliz
ed b
y 2D
Matrix multiplication on 16,384 nodes of BG/P
95% reduction in comm computationidle
communication
c=16copies
Dis:nguishedPaperAward,EuroPar’11SC’11paperbySolomonik,Bhatele,D.
12xfaster
2.7xfaster
PerfectStrongScaling–inTimeandEnergy(1/2)• Every/meyouaddaprocessor,youshoulduseitsmemoryMtoo• Startwithminimalnumberofprocs:PM=3n2• IncreasePbyafactorofcètotalmemoryincreasesbyafactorofc• Nota/onfor/mingmodel:
– γT,βT,αT=secsperflop,perword_moved,permessageofsizem• T(cP)=n3/(cP)[γT+βT/M1/2+αT/(mM1/2)]=T(P)/c• Nota/onforenergymodel:
– γE,βE,αE=joulesforsameopera/ons– δE=joulesperwordofmemoryusedpersec– εE=joulespersecforleakage,etc.
• E(cP)=cP{n3/(cP)[γE+βE/M1/2+αE/(mM1/2)]+δEMT(cP)+εET(cP)}=E(P)• Limit:c≤P1/3(3Dalgorithm),ifstar/ngwith1copyofinputs
1/25/16
21
PerfectStrongScaling–inTimeandEnergy(2/2)• PerfectscalingextendstoN-body,Strassen,…• Canprovelowerboundsonnetwork(eg3Dtorusformatmul)• Wecanusethesemodelstoanswermanyques/ons,including:
• Whatistheminimumenergyrequiredforacomputa/on?• Givenamaximumallowedrun/meT,whatistheminimum
energyEneededtoachieveit?• GivenamaximumenergybudgetE,whatistheminimum
run/meTthatwecanalain?• Thera/oP=E/Tgivesustheaveragepowerrequiredtorun
thealgorithm.Canweminimizetheaveragepowerconsumed?• Givenanalgorithm,problemsize,numberofprocessorsand
targetenergyefficiency(GFLOPS/W),canwedetermineasetofarchitecturalparameterstodescribeaconformingcomputerarchitecture?
• SeeAndrewGearhart’sPhDthesis
HandlingHeterogeneity
• SupposeeachofPprocessorscoulddiffer– γi=sec/flop,βi=sec/word,αi=sec/message,Mi=memory
• Whatisop/malassignmentofworkFitominimize/me?– Ti=Fiγi+Fiβi/Mi
1/2+Fiαi/Mi3/2=Fi[γi+βi/Mi
1/2+αi/Mi3/2]=Fiξi
– ChooseFisoΣiFi=n3andminimizingT=maxiTi
– Answer:Fi=n3(1/ξi)/Σj(1/ξj)andT=n3/Σj(1/ξj)• Op/malAlgorithmfornxnmatmul
– Recursivelydivideinto8half-sizedsubproblems– AssignsubproblemstoprocessoritoadduptoFiflops
• WorksforStrassen,otheralgorithms…
1/25/16
22
Applica/ontoTensorContrac/ons
• Ex:C(i,j,k)=ΣmnA(i,j,m,n)*B(m,n,k)– Communica/onlowerboundsapply
• Complexsymmetriespossible– Ex:B(m,n,k)=B(k,m,n)=…– d-foldsymmetrycansaveuptod!-foldflops/memory
• Heavilyusedinelectronicstructurecalcula/ons– Ex:NWChem
• CTF:CyclopsTensorFramework– Exploits2.5Dalgorithms,symmetries
– Solomonik,Hammond,Malhews
C(i,j,k)=ΣmA(i,j,m)*B(m,k)
A3-foldsymm
B2-foldsymm
C2-foldsymm
1/25/16
23
Applica/ontoTensorContrac/ons
• Ex:C(i,j,k)=ΣmnA(i,j,m,n)*B(m,n,k)– Communica/onlowerboundsapply
• Complexsymmetriespossible– Ex:B(m,n,k)=B(k,m,n)=…– d-foldsymmetrycansaveuptod!-foldflops/memory
• Heavilyusedinelectronicstructurecalcula/ons– Ex:NWChem,forcoupledcluster(CC)approachtoSchroedingereqn.
• CTF:CyclopsTensorFramework– Exploits2.5Dalgorithms,symmetries– Upto3xfasterrunningCCthanNWChemon3072coresofCrayXE6– Solomonik,Hammond,Malhews
TSQR:QRofaTall,Skinnymatrix
46
W=
Q00R00Q10R10Q20R20Q30R30
W0W1W2W3
Q00Q10Q20Q30
= = .
R00R10R20R30
R00R10R20R30
=Q01R01Q11R11
Q01Q11
= . R01R11
R01R11
= Q02R02
1/25/16
24
TSQR:QRofaTall,Skinnymatrix
47
W=
Q00R00Q10R10Q20R20Q30R30
W0W1W2W3
Q00Q10Q20Q30
= = .
R00R10R20R30
R00R10R20R30
=Q01R01Q11R11
Q01Q11
= . R01R11
R01R11
= Q02R02
Output={Q00,Q10,Q20,Q30,Q01,Q11,Q02,R02}
TSQR:AnArchitecture-DependentAlgorithm
W=
W0W1W2W3
R00R10R20R30
R01
R11
R02Parallel:
W=
W0W1W2W3
R01 R02
R00
R03Sequen/al:
W=
W0W1W2W3
R00R01
R01R11
R02
R11R03
DualCore:
Canchoosereduc/ontreedynamicallyMul/core/Mul/socket/Mul/rack/Mul/site/Out-of-core:?
1/25/16
25
TSQRPerformanceResults• ParallelSpeedups
– Upto8xon8coreIntelClovertown– Upto6.7xon16processorPen/umcluster– Upto4xon32processorIBMBlueGene– Upto13xonNVidiaGPU– Upto4xon4ci/esvs1city(Dongarra,Langouetal)– Only1.6xsloweronCloudthanjustaccessingdatatwice(GleichandBenson)
• Sequen/alSpeedup– “Infinite”forout-of-coreonPowerPClaptop
• SVDcostsaboutthesame• JointworkwithGrigori,Hoemmen,Langou,Anderson,Ballard,Keutzer,others
49DatafromGreyBallard,MarkHoemmen,LauraGrigori,JulienLangou,JackDongarra,MichaelAnderson
UsingsimilarideaforTSLUasTSQR:Usereduc/ontree,todo“TournamentPivo/ng”
50
Wnxb =
W1 W2 W3 W4
P1·L1·U1 P2·L2·U2 P3·L3·U3 P4·L4·U4
=
Choose b pivot rows of W1, call them W1’ Choose b pivot rows of W2, call them W2’ Choose b pivot rows of W3, call them W3’ Choose b pivot rows of W4, call them W4’
W1’ W2’ W3’ W4’
P12·L12·U12 P34·L34·U34
= Choose b pivot rows, call them W12’ Choose b pivot rows, call them W34’
W12’ W34’
= P1234·L1234·U1234
Choose b pivot rows
• Go back to W and use these b pivot rows • Move them to top, do LU without pivoting • Extra work, but lower order term
• Thm: As numerically stable as Partial Pivoting on a larger matrix
1/25/16
26
LUSpeedupsfromTournamentPivo/ngand2.5D
0
20
40
60
80
100
256 512 1024 2048
Perc
enta
ge o
f m
ach
ine p
eak
#nodes
2.5D LU with CA-pivoting on BG/P (n=65,536)
2.5D LU (CA-pvt)2D LU (CA-pvt)
ScaLAPACK PDGETRF
2.5Dvs2DLUWithandWithoutPivo/ng
0
20
40
60
80
100
NO-pivot 2D
NO-pivot 2.5D
CA-pivot 2D
CA-pivot 2.5D
Tim
e (
sec)
LU on 16,384 nodes of BG/P (n=131,072)
2X faster
2X faster
computeidle
communication
Thm:PerfectStrongScalingimpossible,becauseLatency*Bandwidth=Ω(n2)
1/25/16
27
Other CA algorithms • Need for pivoting arises beyond LU, in QR
– Choose permutation P so that leading columns of A*P = Q*R span column space of A – Rank Revealing QR (RRQR)
– Usual approach like Partial Pivoting • Put longest column first, update rest of matrix, repeat • Hard to do using BLAS3 at all, let alone hit lower bound
– Use Tournament Pivoting • Each round of tournament selects best b columns from
two groups of b columns, either using usual approach or something better (Gu/Eisenstat)
• Thm: This approach ``reveals the rank’’ of A in the sense that the leading rxr submatrix of R has singular values “near” the largest r singular values of A; ditto for trailing submatrix
– Idea extends to other pivoting schemes • Cholesky with diagonal pivoting • LU with complete pivoting • LDLT with complete pivoting 53"
Communica/onLowerBoundsforStrassen-likematmulalgorithms
• Proof:graphexpansion(differentfromclassicalmatmul)– Strassen-like:DAGmustbe“regular”andconnected
• ExtendsuptoM=n2/p2/ω
• BestPaperPrize(SPAA’11),Ballard,D.,Holtz,SchwartzappearedinJACM
• Isthelowerboundalainable?
ClassicalO(n3)matmul:
#words_moved=Ω(M(n/M1/2)3/P)
Strassen’sO(nlg7)matmul:
#words_moved=Ω(M(n/M1/2)lg7/P)
Strassen-likeO(nω)matmul:
#words_moved=Ω(M(n/M1/2)ω/P)
1/25/16
28
vs.
Runsall7mul/pliesinparallelEachonP/7processorsNeeds7/4asmuchmemory
Runsall7mul/pliessequen/allyEachonallPprocessorsNeeds1/4asmuchmemory
CAPSIfEnoughMemoryandP≥7thenBFSstepelseDFSstependif
Communication Avoiding Parallel Strassen (CAPS)
Inprac/ce,howtobestinterleaveBFSandDFSisa“tuningparameter”
56
Performance Benchmarking, Strong Scaling Plot Franklin (Cray XT4) n = 94080
Speedups:24%-184%(overpreviousStrassen-basedalgorithms)
InvitedtoappearasResearchHighlightinCACM
1/25/16
29
Whataboutfastalgorithmsfortherestoflinearalgebra?
• “FastMatrixMul/plica/onisStable”– JD,I.Dumitriu,O.Holtz,R.Kleinberg(2007)
• “FastLinearAlgebraisStable”– JD,I.Dumitriu,O.Holtz(2007)
• “Sequen/alCommunica/onBoundsforFastLinearAlgebra”– G.Ballard,JD,O.Holtz,O.Schwartz(EECS-2012-36)
• Parallelcommunica/onbounds?• Implementa/ons?
SymmetricBandReduc/on
• GreyBallardandNickKnight• A⇒QAQT=T,where
– A=ATisbanded– Ttridiagonal– SimilarideaforSVDofabandmatrix
• Usealone,orassecondphasewhenAisdense:– Dense⇒Banded⇒Tridiagonal
• ImplementedinLAPACK’ssytrd• Algorithmdoesnotsa/sfycommunica/onlowerboundtheoremforapplyingorthogonaltransforma/ons– Itcancommunicateevenless!
1/25/16
30
Conven/onalvsCA-SBR
Conven/onal Communica/on-Avoiding
ManytuningparametersRightchoicesreduce#words_movedbyfactorM/bw,notjustM1/2
Touchalldata4/mes Touchalldataonce
Manytuningparameters:Numberof“sweeps”,#diagonalsclearedpersweep,sizesofparallelograms#bulgeschasedatone/me,howfartochaseeachbulgeRightchoicesreduce#words_movedbyfactorM/bw,notjustM1/2
SpeedupsofSym.BandReduc/onvsLAPACK’sDSBTRD
• Upto17xonIntelGainestown,vsMKL10.0– n=12000,b=500,8threads
• Upto12xonIntelWestmere,vsMKL10.3– n=12000,b=200,10threads
• Upto25xonAMDBudapest,vsACML4.4– n=9000,b=500,4threads
• Upto30xonAMDMagny-Cours,vsACML4.4– n=12000,b=500,6threads
• NeitherMKLnorACMLbenefitsfrommul/threadinginDSBTRD– Bestsequen/alspeedupvsMKL:1.9x– Bestsequen/alspeedupvsACML:8.5x
1/25/16
31
What about sparse matrices? (1/3) • If matrix quickly becomes dense, use dense algorithm • Ex: All Pairs Shortest Path using Floyd-Warshall • Similar to matmul: Let D = A, then
• But can’t reorder outer loop for 2.5D, need another idea • Abbreviate D(i,j) = min(D(i,j),mink(A(i,k)+B(k,j)) by D = A¤B
– Dependencies ok, 2.5D works, just different semiring • Kleene’s Algorithm:
61"
for k = 1:n, for i = 1:n, for j=1:n D(i,j) = min(D(i,j), D(i,k) + D(k,j))
D = DC-APSP(A,n) D = A, Partition D = [[D11,D12];[D21,D22]] into n/2 x n/2 blocks D11 = DC-APSP(D11,n/2), D12 = D11 ! D12, D21 = D21 ! D11, D22 = D21 ! D12, D22 = DC-APSP(D22,n/2), D21 = D22 ! D21, D12 = D12 ! D22, D11 = D12 ! D21,
Performance of 2.5D APSP using Kleene
62"
c=1
0
200
400
600
800
1000
1200
1 4 16 64 256
1024 1 4 16 64 25
610
24
GFl
ops
Number of compute nodes
n=4096
n=8192
c=16c=4
Strong Scaling on Hopper (Cray XE6 with 1024 nodes = 24576 cores)
6.2x speedup
2x speedup
1/25/16
32
What about sparse matrices? (2/3) • If parts of matrix becomes dense, optimize those • Ex: Cholesky on matrix A with good separators • Thm (Lipton,Rose,Tarjan,’79) If all balanced separators
of G(A) have at least w vertices, then G(chol(A)) has clique of size w – Need to do dense Cholesky on w x w submatrix
• Thm: #Words_moved = Ω(w3/M1/2) etc • Thm (George,’73) Nested dissection gives optimal
ordering for 2D grid, 3D grid, similar matrices – w = n for 2D n x n grid, w = n2 for 3D n x n x n grid
• Sequential multifrontal Cholesky attains bounds • PSPACES (Gupta, Karypis, Kumar) is a parallel sparse
multifrontal Cholesky package – Attains 2D and 2.5D lower bounds (using optimal dense Cholesky on
separators) 63"
What about sparse matrices? (3/3) • Ifmatrixstaysverysparse,lowerboundunalainable,newone?• Ex:A*B,bothdiagonal:nocommunica/oninparallelcase• Ex:A*B,bothareErdos-Renyi:Prob(A(i,j)≠0)=d/n,d<<n1/2,iid• Assump/on:Algorithmissparsity-independent:assignmentof
dataandworktoprocessorsissparsity-palern-independent(butzeroentriesneednotbecommunicatedoroperatedon)
• Thm:Aparallelalgorithmthatissparsity-independentandloadbalancedforErdos-Renyimatmulsa/sfies(inexpecta/on)
#Words_moved=Ω(min(dn/P1/2,d2n/P))– ProofexploitsfactthatreuseofentriesofC=A*Bunlikely
• Contrastgenerallowerbound:#Words_moved=Ω(d2n/(PM1/2)))• Alainedbydivide-and-conqueralgorithmthatsplitsmatrices
alongdimensionsmostlikelytominimizecost• Recentresult(P.Koanantakooletal,IPDPS’16):Dense*Sparse
64"
1/25/16
33
SummaryofDirectLinearAlgebra(1/2)• Newlowerbounds,op/malalgorithms,bigspeedupsintheoryandprac/ce• Lotsofongoingwork/possibleprojectson
– Algorithms:• LDLT,QRwithpivo/ng,otherpivo/ngschemes,lowrankfactoriza/ons,eigenproblems…• All-pairs-shortest-path,…• Both2D(c=1)and2.5D(c>1)• Butonlybandwidthmaydecreasewithc>1,notlatency• Sparsematrices
– Pla�orms:• Mul/core,cluster,GPU,cloud,heterogeneous,low-energy,…
– Somware:• Integra/onintoSca/LAPACK,PLASMA,MAGMA,…
• Integra/onofCTFintoquantumchemistry/DFTapplica/ons– Aquarius,withANL,UTAus/nonIBMBG/Q,CrayXC30– Qbox,withLLNL,IBM,onIBMBG/Q– Q-Chem,workinprogress
• Integra/onintobigdataanalysissystembasedonSparkatAMPLab
SummaryofDirectLinearAlgebra(2/2)• Newlowerbounds,op/malalgorithms,bigspeedupsintheoryandprac/ce
• Lotsofongoingwork/possibleprojectson– Algorithms:
• LDLT,QRwithpivo/ng,otherpivo/ngschemes,lowrankfactoriza/ons,eigenproblems…
• ComparefastQR(• All-pairs-shortest-path,…• Both2D(c=1)and2.5D(c>1)• Butonlybandwidthmaydecreasewithc>1,notlatency• Sparsematrices
– Pla�orms:• Mul/core,cluster,GPU,cloud,heterogeneous,low-energy,…
– Somware:• Integra/onintoSca/LAPACK,PLASMA,MAGMA,…
1/25/16
34
Outline• “Direct”LinearAlgebra
• Lowerboundsoncommunica/on• Newalgorithmsthatalaintheselowerbounds
• Diloforprogramsaccessingarrays(egn-body)• Dilofor“Itera/ve”LinearAlgebra• RelatedWork
Recallop/malsequen/alMatmul• Naïvecodefori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)• “Blocked”codefori=1:n/b,forj=1:n/b,fork=1:n/bC[i,j]+=A[i,k]*B[k,j]…bxbmatmul• Thm:Pickingb=M1/2alainslowerbound:#words_moved=Ω(n3/M1/2)• Wheredoes1/2comefrom?
1/25/16
35
NewThmappliedtoMatmul• fori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)• RecordarrayindicesinmatrixΔ
• SolveLPforx=[xi,xj,xk]T:max1Txs.t.Δx≤1
– Result:x=[1/2,1/2,1/2]T,1Tx=3/2=e• Thm:#words_moved=Ω(n3/Me-1)=Ω(n3/M1/2)AlainedbyblocksizesMxi,Mxj,Mxk=M1/2,M1/2,M1/2
i j k1 0 1 A
Δ= 0 1 1 B
1 1 0 C
NewThmappliedtoDirectN-Body• fori=1:n,forj=1:n,F(i)+=force(P(i),P(j))• RecordarrayindicesinmatrixΔ
• SolveLPforx=[xi,xj]T:max1Txs.t.Δx≤1
– Result:x=[1,1],1Tx=2=e• Thm:#words_moved=Ω(n2/Me-1)=Ω(n2/M1)AlainedbyblocksizesMxi,Mxj=M1,M1
i j1 0 F
Δ= 1 0 P(i)
0 1 P(j)
1/25/16
36
N-BodySpeedupsonIBM-BG/P(Intrepid)8Kcores,32Kpar/cles
11.8xspeedup
K.Yelick,E.Georganas,M.Driscoll,P.Koanantakool,E.Solomonik
04/07/2015 CS267 Lecture 21
Some Applications • Gravity, Turbulence, Molecular Dynamics, Plasma
Simulation, … • Electron-Beam Lithography Device Simulation • Hair ...
– www.fxguide.com/featured/brave-new-hair/ – graphics.pixar.com/library/CurlyHairA/paper.pdf
72
1/25/16
37
NewThmappliedtoRandomCode• fori1=1:n,fori2=1:n,…,fori6=1:nA1(i1,i3,i6)+=func1(A2(i1,i2,i4),A3(i2,i3,i5),A4(i3,i4,i6))A5(i2,i6)+=func2(A6(i1,i4,i5),A3(i3,i4,i6))• RecordarrayindicesinmatrixΔ
• SolveLPforx=[x1,…,x6]T:max1Txs.t.Δx≤1
– Result:x=[2/7,3/7,1/7,2/7,3/7,4/7],1Tx=15/7=e• Thm:#words_moved=Ω(n6/Me-1)=Ω(n6/M8/7)AlainedbyblocksizesM2/7,M3/7,M1/7,M2/7,M3/7,M4/7
i1 i2 i3 i4 i5 i6
1 0 1 0 0 1 A1
1 1 0 1 0 0 A2
Δ= 0 1 1 0 1 0 A3
0 0 1 1 0 1 A3,A4
0 0 1 1 0 1 A5
1 0 0 1 1 0 A6
Approachtogeneralizinglowerbounds• Matmulfori=1:n,forj=1:n,fork=1:n,C(i,j)+=A(i,k)*B(k,j)=>for(i,j,k)inS=subsetofZ3
Accessloca/onsindexedby(i,j),(i,k),(k,j)• Generalcasefori1=1:n,fori2=i1:m,…forik=i3:i4C(i1+2*i3-i7)=func(A(i2+3*i4,i1,i2,i1+i2,…),B(pnt(3*i4)),…)D(somethingelse)=func(somethingelse),…=>for(i1,i2,…,ik)inS=subsetofZkAccessloca/onsindexedby“projec/ons”,egφC(i1,i2,…,ik)=(i1+2*i3-i7)φA(i1,i2,…,ik)=(i2+3*i4,i1,i2,i1+i2,…),…• Goal:Communica/onlowerboundsandop/malalgorithmsforany
programthatlookslikethis
• Canwebound#loop_itera/ons/pointsinSgivenboundson#pointsinitsimagesφC(S),φA(S),…?
1/25/16
38
GeneralCommunica/onBound
• Thm:Givenaprogramwitharrayrefsgivenbyprojec/onsφj,thenthereisane≥1suchthat
#words_moved=Ω(#itera/ons/Me-1)whereeisthethevalueofalinearprogram:minimizee=Σjejsubjecttorank(H)≤Σjej*rank(φj(H))forallsubgroupsH<Zk• Proofdependsonrecentresultinpuremathema/csby
Christ/Tao/Carbery/Bennel
• GivenSsubsetofZk,grouphomomorphismsφ1,φ2,…,bound|S|intermsof|φ1(S)|,|φ2(S)|,…,|φm(S)|
• Thm(Christ/Tao/Carbery/Bennel):Givens1,…,sm|S|≤Πj|φj(S)|
sj
Isthisboundalainable(1/2)?• Butfirst:Canwewriteitdown?
– OneinequalitypersubgroupH<Zd,buts/llfinitelymany!– Thm(badnews):Wri/ngdownallinequali/esinLPreducestoHilbert’s10thproblemoverQ
• Couldbeundecidable:openques/on– Thm(goodnews):AnotherLPhassamesolu/on,isdecidable(butexpensivesofar)
– Thm:(belernews)EasytowriteLPdownexplicitlyinmanycasesofinterest:
• Whenatmost3arrays• Whenatmost4loopindices• Whensubscriptsaresubsetsofindices
– Alsoeasytogetupper/lowerboundsone
• Tarski-decidabletogetsupersetofconstraints(mayget.sHBLtoolarge)
1/25/16
39
Isthisboundalainable(2/2)?
• Dependsonloopdependencies• Bestcase:none,orreduc/ons(matmul)• Thm:Whenallsubscriptsaresubsetsofindices,thesolu/onxofthedualLPgivesop/mal/lesizes:Mx1,Mx2,…
• Ex:Linearalgebra,n-body,“randomcode,”join,…• Conjecture:alwaysalainable(modulodependencies):workinprogress/classproject
• Longtermgoal:incorporateincompilers
OngoingWork
• Automategenera/onoflowerbounds• Extend“perfectscaling”resultsfor/meandenergybyusingextramemory
• Haveyettofindacasewherewecannotalainlowerbound(dependenciespermi�ng)–canweprovethis?
• Incorporateintocompilers
1/25/16
40
Outline• “Direct”LinearAlgebra
• Lowerboundsoncommunica/on• Newalgorithmsthatalaintheselowerbounds
• Diloforprogramsaccessingarrays(egn-body)• Dilofor“Itera/ve”LinearAlgebra• Relatedwork
AvoidingCommunica/oninItera/veLinearAlgebra
• k-stepsofitera/vesolverforsparseAx=borAx=λx– DoeskSpMVswithAandstar/ngvector– Manysuch“KrylovSubspaceMethods”
• Goal:minimizecommunica/on– Assumematrix“well-par//oned”– Serialimplementa/on
• Conven/onal:O(k)movesofdatafromslowtofastmemory• New:O(1)movesofdata–op:mal
– Parallelimplementa/ononpprocessors• Conven/onal:O(klogp)messages(kSpMVcalls,dotprods)• New:O(logp)messages-op:mal
• Lotsofspeeduppossible(modeledandmeasured)– Price:someredundantcomputa/on
80
1/25/16
41
1234… …32
x
A·x
A2·x
A3·x
Communica/onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera/onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3• Worksforany“well-par//oned”A
1234… …32
x
A·x
A2·x
A3·x
Communica/onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera/onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3
1/25/16
42
1234… …32
x
A·x
A2·x
A3·x
Communica/onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera/onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3
1234… …32
x
A·x
A2·x
A3·x
Communica/onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera/onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3
1/25/16
43
1234… …32
x
A·x
A2·x
A3·x
Communica/onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera/onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3
1234… …32
x
A·x
A2·x
A3·x
Communica/onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera/onsofy=A⋅xwith[Ax,A2x,…,Akx]
• Example:Atridiagonal,n=32,k=3
1/25/16
44
1234… …32
x
A·x
A2·x
A3·x
Communica/onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera/onsofy=A⋅xwith[Ax,A2x,…,Akx]• Sequen/alAlgorithm
• Example:Atridiagonal,n=32,k=3
Step1
1234… …32
x
A·x
A2·x
A3·x
Communica/onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera/onsofy=A⋅xwith[Ax,A2x,…,Akx]• Sequen/alAlgorithm
• Example:Atridiagonal,n=32,k=3
Step1 Step2
1/25/16
45
1234… …32
x
A·x
A2·x
A3·x
Communica/onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera/onsofy=A⋅xwith[Ax,A2x,…,Akx]• Sequen/alAlgorithm
• Example:Atridiagonal,n=32,k=3
Step1 Step2 Step3
1234… …32
x
A·x
A2·x
A3·x
Communica/onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera/onsofy=A⋅xwith[Ax,A2x,…,Akx]• Sequen/alAlgorithm
• Example:Atridiagonal,n=32,k=3
Step1 Step2 Step3 Step4
1/25/16
46
1234… …32
x
A·x
A2·x
A3·x
Communica/onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera/onsofy=A⋅xwith[Ax,A2x,…,Akx]• ParallelAlgorithm
• Example:Atridiagonal,n=32,k=3• Eachprocessorcommunicatesoncewithneighbors
Proc1 Proc2 Proc3 Proc4
1234… …32
x
A·x
A2·x
A3·x
Communica/onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
• Replacekitera/onsofy=A⋅xwith[Ax,A2x,…,Akx]• ParallelAlgorithm
• Example:Atridiagonal,n=32,k=3• Eachprocessorworkson(overlapping)trapezoid
Proc1 Proc2 Proc3 Proc4
1/25/16
47
Sameideaworksforgeneralsparsematrices
Communica/onAvoidingKernels:TheMatrixPowersKernel:[Ax,A2x,…,Akx]
Simpleblock-rowpar//oningè(hyper)graphpar//oningLem-to-rightprocessingè TravelingSalesmanProblem
MinimizingCommunica/onofGMREStosolveAx=b• GMRES:findxinspan{b,Ab,…,Akb}minimizing||Ax-b||2
StandardGMRESfori=1tokw=A·v(i-1)…SpMVMGS(w,v(0),…,v(i-1))updatev(i),HendforsolveLSQproblemwithH
Communica:on-avoidingGMRESW=[v,Av,A2v,…,Akv][Q,R]=TSQR(W)…“TallSkinnyQR”buildHfromRsolveLSQproblemwithH
• Oops–Wfrompowermethod,precisionlost!94
Sequen/alcase:#wordsmoveddecreasesbyafactorofkParallelcase:#messagesdecreasesbyafactorofk
1/25/16
48
“Monomial”basis[Ax,…,Akx]failstoconverge
Differentpolynomialbasis[p1(A)x,…,pk(A)x]doesconverge
95
SpeedupsofGMRESon8-coreIntelClovertown
[MHDY09]
96
RequiresCo-tuningKernels
1/25/16
49
97
CA-BiCGStab
1/25/16
50
Naive Monomial Newton Chebyshev
ReplacementIts. 74(1) [7,15,24,31,…,92,97,103](17)
[67,98](2) 68(1)
WithResidualReplacement(RR)alaVanderVorstandYe
SampleApplica/onSpeedups
100
• GeometricMul/grid(GMG)wCABolomSolver• ComparedBICGSTABvs.CA-BICGSTABwiths=4• HopperatNERSC(CrayXE6),weakscaling:Upto4096MPIprocesses(24,576corestotal)
• SpeedupsforminiGMGbenchmark(HPGMGbenchmarkpredecessor)– 4.2xinbolomsolve,2.5xoverallGMGsolve
• Implementedasasolverop/oninBoxLibandCHOMBOAMRframeworks
– 3DLMC(alow-machnumbercombus/oncode)• 2.5xinbolomsolve,1.5xoverallGMGsolve
– 3DNyx(anN-bodyandgasdynamicscode)• 2xinbolomsolve,1.15xoverallGMGsolve
• SolveHorn-SchunckOp/calFlowEqua/ons• ComparedCGvs.CA-CGwiths=3,43%fasteronNVIDIAGT640GPU
1/25/16
51
TuningspaceforKrylovMethods
Explicit(O(nnz)) Implicit(o(nnz))
Explicit(O(nnz)) CSRandvaria/ons Vision,climate,AMR,…
Implicit(o(nnz)) GraphLaplacian Stencils
Nonzeroentries
Indices
• Classifica/onsofsparseoperatorsforavoidingcommunica/on• Explicitindicesornonzeroentriescausemostcommunica/on,alongwithvectors• Ex:Withstencils(allimplicit)allcommunica/onforvectors
• Opera/ons• [x,Ax,A2x,…,Akx]or[x,p1(A)x,p2(A)x,…,pk(A)x]• Numberofcolumnsinx• [x,Ax,A2x,…,Akx]and[y,ATy,(AT)2y,…,(AT)ky],or[y,ATAy,(ATA)2y,…,(ATA)ky],• returnallvectorsorjustlastone
• Cotuningand/orinterleaving• W=[x,Ax,A2x,…,Akx]and{TSQR(W)orWTWor…}• Dilo,butthrowawayW
• Precondi/onedversions
SummaryofItera/veLinearAlgebra
• NewLowerbounds,op/malalgorithms,bigspeedupsintheoryandprac/ce
• Lotsofotherprogress,openproblems– Manydifferentalgorithmsreorganized
• Moreunderway,moretobedone– Needtorecognizestablevariantsmoreeasily– Precondi/oning
• “Underlapping”insteadof“overlapping”DomainDecomposi/on• HierarchicallySemiseparableMatrices
– Autotuningandsynthesis• pOSKIforSpMV–availableatbebop.cs.berkeley.edu• Differentkindsof“sparsematrices”
1/25/16
52
Outline• “Direct”LinearAlgebra
• Lowerboundsoncommunica/on• Newalgorithmsthatalaintheselowerbounds
• Diloforprogramsaccessingarrays(egn-body)• Dilofor“Itera/ve”LinearAlgebra• Relatedwork
• Write-AvoidingAlgorithms• Reproducibility
Write-AvoidingAlgorithms• Whatifwritesaremoreexpensivethanreads?
– Nonvola/leMemory(Flash,PCM,…)– Savingintermediatestodiskincloud(egSpark)– Extracoherencytrafficinsharedmemory
• Canwedesign“write-avoiding(WA)”algorithms?– Goal:findandalainbelerlowerboundforwrites– Thm:Forclassicalmatmul,possibletodoasympto/callyfewerwritesthanreadstogivenlayerofmemoryhierarchy
– Thm:Cache-obliviousalgorithmscannotbewrite-avoiding– Thm:StrassenandFFTcannotbewrite-avoiding
1/25/16
53
128$ 256$ 512$ 1K$ 2K$ 4K$ 8K$ 16K$ 32K$L3_VICTIMS.M$ 1.9$ 2.1$ 1.8$ 2.3$ 4.8$ 9.8$ 19.5$ 39.6$ 78.5$
L3_VICTIMS.E$ 0.4$ 0.8$ 1.6$ 4.2$ 8.8$ 17.9$ 36.6$ 75.4$ 147.5$
LLC_S_FILLS.E$ 2.4$ 2.8$ 3.8$ 6.9$ 14$ 28.1$ 56.5$ 115.5$ 226.6$
Misses$on$Ideal$Cache$ 2.512$ 3.024$ 4.048$ 6$ 12$ 24$ 48$ 96$ 192$
Write$L.B.$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$
0.1$
1$
10$
100$
1000$
Blocking Sizes
L3: CO L2: CO L1: CO
128$ 256$ 512$ 1K$ 2K$ 4K$ 8K$ 16K$ 32K$L3_VICTIMS.M$ 2$ 2$ 1.9$ 2$ 2.3$ 2.7$ 3$ 3.6$ 4.4$
L3_VICTIMS.E$ 0.4$ 0.9$ 1.9$ 4.2$ 11.8$ 25.3$ 50.7$ 101.7$ 203.2$
LLC_S_FILLS.E$ 2.5$ 3$ 4.1$ 6.5$ 14.5$ 28.3$ 54$ 105.7$ 208.2$
Write$L.B.$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$ 2$
0.1$
1$
10$
100$
1000$
Blocking Sizes
L3: 700 L2: MKL L1: MKL
Cache-ObliviousMatmul#DRAMreadsclosetoop/mal#DRAMwritesmuchlarger
MeasuredL3-DRAMtrafficonIntelNehalemXeon-7560Op/mal#DRAMreads=O(n3/M1/2)Op/mal#DRAMwrites=n2
Write-AvoidingMatmulTotalL3missesclosetoop/malTotalDRAMwritesmuchlarger
ReproducibleFloa/ngPointComputa/on
• Doyougetthesameanswerifyourunthesameprogramtwicewiththesameinput?– Notevenonyourmul/corelaptop!
• Floa/ngpointaddi/onisnonassocia/ve,summa/onordernotreproducible
• FirstreleaseoftheReproBLAS– ReproducibleBLAS1,independentofdataorder,numberofprocessors,datalayout,reduc/ontree,…
– Sequen/alanddistributedmemory(MPI)• bebop.cs.berkeley.edu/reproblas
1/25/16
54
Morepossibleclassprojects
• Couldbeoneormoreof– Extendlowerbounds,newalgorithms,compareexis/ngalgorithms,usealgorithmstoimproveexis/ngapplica/ons,performanceanalysis/measurement,compilerinfrastructure,presentexis/ngresults…
• Extendtomachinelearningalgorithms,otherofthe``13mo/fs’’
Formoredetails
• Bebop.cs.berkeley.edu– 155pagesurveyinActaNumerica
• CS267–Berkeley’sParallelCompu/ngCourse– LivebroadcastinSpring2016
• www.cs.berkeley.edu/~demmel• Allslides,videoavailable
– PrerecordedversionbroadcastsinceSpring2013• www.xsede.org• Freesupercomputeraccountstodohomework• Freeautogradingofhomework
1/25/16
55
CollaboratorsandSupporters• JamesDemmel,KathyYelick,MichaelAnderson,GreyBallard,ErinCarson,Aditya
Devarakonda,MichaelDriscoll,DavidEliahu,AndrewGearhart,EvangelosGeorganas,NicholasKnight,PenpornKoanantakool,BenLipshitz,OdedSchwartz,EdgarSolomonik,OmerSpillinger
• Aus/nBenson,MaryamDehnavi,MarkHoemmen,ShoaibKamil,MarghoobMohiyuddin• AbhinavBhatele,AydinBuluc,MichaelChrist,IoanaDumitriu,ArmandoFox,David
Gleich,MingGu,JeffHammond,MikeHeroux,OlgaHoltz,KurtKeutzer,JulienLangou,DevinMalhews,TomScanlon,MichelleStrout,SamWilliams,HuaXiang
• JackDongarra,JakubKurzak,DulceneiaBecker,IchitaroYamazaki,…• SivanToledo,AlexDruinsky,InonPeled• LauraGrigori,Sebas/enCayrols,SimpliceDonfack,MathiasJacquelin,AmalKhabou,
SophieMoufawad,MikolajSzydlarski• MembersofParLab,ASPIRE,BEBOP,CACHE,EASI,FASTMath,MAGMA,PLASMA• ThankstoDOE,NSF,UCDiscovery,INRIA,Intel,Microsom,Mathworks,Na/onal
Instruments,NEC,Nokia,NVIDIA,Samsung,Oracle• bebop.cs.berkeley.edu
Summary
Don’tCommunic…
110
Timetoredesignalllinearalgebra,n-body,…algorithmsandsomware
(andcompilers…)