adaptive multi-level blocking optimization for sparse ... · adaptive multi-level blocking...
TRANSCRIPT
-
AdaptiveMulti-levelBlockingOptimizationforSparseMatrixVectorMultiplication
onGPU
YusukeNagasaka,AkiraNukada,SatoshiMatsuokaTokyoInstituteofTechnology
-
SparseMatrixComputation• Largeandsparseequations– GeneratedbyFEM– Sparsematrixvectormultiplication(SpMV)requiresmuchexecutiontimeinsolvingequation• Usingsparseformat,whichremovesneedlesszeroelements
– Indirectmemoryaccesstoinputvectorelement» Randommemoryaccess :Frequentcachemisses
– Requireindexofrowandcolumnforeachnon-zeroelement» Aggravatememoryboundness ofMat-vec
=>PerformancedegradationofSpMV
2
10
20
25
15
30
40
45
35
50
0 2 93 6 7
0 2 4 31 0 2 42
10 15 35 4020 25 45 5030
Row pointer
Column id
Value
CSR Format
-
SpMVonMany-coreProcessors
• Accelerationbyhighparallelismandmemorybandwidth– Smallcacheó Enormousexecutingthreads
• Morefrequentcachemisses– Performanceisstronglylimitedbymemorybandwidth• Notexploitingmanycoreprocessors
3
-
SparseFormat
• Compressingneedlesszeroelements– Storingonlynon-zeroelements– Reducingmemoryusageandcomputation
• Eachformathasvarietyofcharacteristics– Beingsuitedtoarchitecturesandgivenmatrices
4
Memory Layout Load-balance MemoryAccess toMatrixDataLocalityofaccesstoinputvector
CSR Row-major - -
ELLPACK Column-major - - -
SELL-C-σ Column-major - -
Segmented,NUS Column-major -
BCCOO(yaSpMV) Row-major -
-
SparseFormat
• Reducingtotalmemoryaccessisimportanttoalleviatethememory-boundness– However,existingworkdoesnotreduceenough– Weshouldconsidertheaccesstobothmatrixdataandinputvectorelements• Reductionoftotalmemoryaccess
5
Memory Layout Load-balance MemoryAccess toMatrixDataLocalityofaccesstoinputvector
CSR Row-major - -
ELLPACK Column-major - - -
SELL-C-σ Column-major - -
Segmented,NUS Column-major -
BCCOO(yaSpMV) Row-major -
-
Contribution
• WeproposenewsparsematrixformatforGPU– AdaptiveMulti-levelBlocking(AMB)format
• Severaloptimizationtechniquessuchasdivisionandblocking• Aggressivelyreducingtotalmemoryaccessincludingbothmatrixdataandinputvector inSpMV toimproveperformance
• PreciselypredictingtotalmemoryaccessandperformanceofSpMV
– For32matrixdatasetstakenfromFloridasparsematrixcollection,achievespeedupsof• Uptox2.92 comparedtocuSPARSE (x1.74onaverage)• Uptox1.40 comparedtoyaSpMV (x1.13onaverage)
6
-
AdaptiveMulti-levelBlocking(AMB)
• LargelyreducingtotalmemoryaccessinSpMV• Constructedinmulti-leveldivisionandblocking– Improvingthelocalityoftheaccesstoinputvectorbydividingmatrixandinputvector
– Mainlyfocusingoncompressionofcolumnindex
7
Memory Layout Load-balance MemoryAccess toMatrixDataLocalityofaccesstoinputvector
CSR Row-major - -
ELLPACK Column-major - - -
SELL-C-σ Column-major - -
Segmented,NUS Column-major -
BCCOO(yaSpMV) Row-major -
AMB Proposal Column-major
-
AMB(AdaptiveMulti-levelBlocking)
• Level1 Column-wiseSegmentation– Improvingthelocalityoftheaccesstoinputvectorelement inSpMV
– Segmentationwidthislessthan65536(=2^16)columns• Forthecompressionofcolumnindexinthesecondleveldivision
8
Segmentationwidth
-
AMB(AdaptiveMulti-levelBlocking)
• Level2:Row-wisesegmentation andcompression– Convertingeachsub-matrixtoSELL-C-σ
• Sortingrowsbythenumberofnon-zeroelementsperrow– Sortingscope(σ)islimited:lessthan32768(2^15)
» σ =6inthisfigure
9
0
1
2
3
4
5
0
1
2
3
4
5Permutation(Originalrowindex)
9
2
1
3
0
2
1
0
0
0
3
2
2
#non-zero/row
Sortingscope(σ)
-
AMB(AdaptiveMulti-levelBlocking)
• Level2:Row-wisesegmentation andcompression– Convertingeachsub-matrixtoSELL-C-σ
• Row-wisedivisionbyCrows(C=3inthisfigure)– Cissetbasedonarchitecture(C=32inGPUcase)
10
2
0
4
1
5
3
3
4
5
0
1
2
Permutation
C
-
AMB(AdaptiveMulti-levelBlocking)
• Level2:Row-wisesegmentation andcompression– Convertingeachsub-matrixtoSELL-C-σ
• ConvertingeachchunktoELLPACK• MorememoryaccesstoCS,CLandPermutationisrequiredcomparedtooriginalSELL-C-σ (w/oLevel1segmentation)
11
00011
121
2 333
445
50
9
12
21
3
1
3
0
Valuedata ColumnIndex CS(ChunkStartingoffset)
CL(ChunkLength)
204153
345012
Permutation
-
AMB(AdaptiveMulti-levelBlocking)
• Level2:Row-wisesegmentationandcompression– Eliminationoftheemptychunk
• RemovingneedlesselementsofCS,CLandPermutation
1212
00011
121
2 333
445
50
9
12
21
3
1
3
0
Valuedata ColumnIndex CS(ChunkStartingoffset)
CL(ChunkLength)
204153
345012
Permutation
-
AMB(AdaptiveMulti-levelBlocking)
• Level2:Row-wisesegmentationandcompression– Columnindicesarerepresentedby16-bitinteger(unsignedshort)• Originalindexisdividedbysegmentationwidthandremainderisstoredascolumnindex– Quotientisstoredtoupper16-bitofthe‘CL’(ChunkLength)array
1313
00
9
12 0
Valuedata ColumnIndex CS(ChunkStartingoffset)
CL(ChunkLength)
204153
345
Permutation
0011
121
2000
112
23
0 1
1 3
-
AMB(AdaptiveMulti-levelBlocking)
• Level2:Row-wisesegmentationandcompression– Compressionoforiginalrowindex(permutation)byusing16-bitinterger (σ ≤32768(=2^15))• Originalrowindexisdividedbysortingscope(σ=8inthisfigure)andremainderisstoredaspermutation– QuotientisstoredtoPermutationoffsetarray(16-bitinteger)
1414
00
9
12 0
Valuedata ColumnIndex CS CL
204153
345
Permutation
0011
121
2000
112
23
0 1
1 3 0
0
0
Permutationoffset
-
AMB(AdaptiveMulti-levelBlocking)
• Level3:Blocking– Regardingmultiplenon-zeroelementsasoneelement
• Compressionofcontiguouscolumnindices(blocksize=3infigure)• Largeblocksize:Compressionrateishigh,butthenumberofzero-fillinginvaluearraymayincrease
1515
0
0
1
2
3
4
2
3
4
5
Valuedata Columnindex
0
0
1
2
3
4
Valuedata Columnindex
Thisimplies
0 1 20 1 21 2 32 3 43 4 54 5 6
-
AMB(AdaptiveMulti-levelBlocking)
• Level3:Blocking– Compressionofcontiguouscolumnindices– Blocksize:1~10(blocksize=3inthefigure)
• Blocksizeischosenconsideringtradeoffbetweencompressionofcolumnindexandzerofillinginvaluedata
1616
00
9
12 0
Valuedata ColumnIndex CS CL
204153
345
Permutation
0011
121
2000
112
23
0 1
1 3 0
0
0
Permutationoffset
00
9
12 0
Valuedata ColumnIndex CS CL
204153
345
Permutation
0011
000
3
0 1
1 3 0
0
0
Permutationoffset
-
SpMVKernelforAMBformatinCUDA
• ConstructedintwoCUDAkernels– Initializingtheoutputvectorwithzero– Matrixvectormultiplicationkernel
• Onethreadisassignedtoeachrowandtheresultofeachrowisaccumulatedintheoutputvectorbyatomicoperation
17Outputvector
000000
1st InitializationKernel 2nd Matrix-VectorMultiplyKernel
204153
345
Permutation
0
0
0
Permutationoffset
-
EstimationofTotalMemoryAccessinSpMV
• Performancepredictionbyestimatingtotalmemoryaccess– AMBformathashighlocalityincachelevel
• Wecanestimatetotalmemoryaccessincludingrandomaccess
• Wecanpredicttheperformancewithoutcomputation=>Utilizingforparametertuning
18
OpenCL atomicAddcompare and swap
4.4
AMBAMB
M × N
• v : Byte v = 4 v = 8
• nzz f : ’value’
• b :
• C :
• nC :
’value’ ’column index’ ’CS’ ’CL’ ’Per-mutation offset’ ’Permutation’
nzz f ∗(v +
2b
)+ nC ∗ 4 ∗ 2 + nC ∗ 2 + nC ∗C ∗ 2 (4.1)
atomicAddatomicAdd read write
N ∗ v + M ∗ v + nC ∗C ∗ v ∗ 2 (4.2)
CUDA AMB
nzz f ∗(v +
2b
)+ nC ∗ (2 ∗ v ∗C + 2 ∗C + 10) + (M + N) ∗ v (4.3)
AMD GPU checkcheck
M + nC ∗CAMD GPU
nzz f ∗(v +
2b
)+ nC ∗ (2 ∗ v ∗C + 3 ∗C + 10) + (M + N) ∗ v + M (4.4)
29
M,N:Rowsize,columnsizev:Byteoffloatingpoint(v=4insingleprecisionandv=8indoubleprecision)nzzf :Numberofelementsincludingzero-fillingb :BlocksizeinthirdlevelC:Chunksizeinsecondlevelnc :Numberofchunk
-
PerformanceEvaluation
19
-
ExperimentEnvironment
• TSUBAME-KFC– GPU:NVIDIATeslaK20X
• Memorysize :6[GB]• Memorybandwidth:250[GB/s]• ECC off• L2cachesize:1.5[MB]• Read-only cachesize :12[KB]*4/SMX
– CUDA:Version7.0
20
-
ExperimentEnvironment
• Matrixdata– TheUniversityofFloridaSparseMatrixCollection
• Selecting32positivedefinitesymmetricmatriceswhoserowsizesarelargerthan131072(=2^17)
• Evaluatedsparseformats– cuSPARSE:CSR,HYBRID,BSR(BlockCSR)– SELL-C-σ– {Segmented,Non-Uniformly-Segmented}-SELL-C-σ– yaSpMV:BCCOO onlysingleprecision
21
-
PerformanceEvaluationSpMVinSingle-precision
• Speedupsof– x2.92onmaximumandx1.74onaverage comparedtocuSPARSE– x1.40onmaximum andx1.13onaverage comparedtoyaSpMV
• Performance ofAMBbecomesworsewhenaveragenumberofnon-zeroelementsperrow(nz/row)issmall
– Benefitofcompressingcolumnindices<CostofatomicAdd
22
0102030405060708090
GFLOPS
SMALL33LARGE
cuSPARSE SELL5C5σ (S,3NUS)5SELL5C5σ yaSpMV AMB
nz/row=24
-
PerformanceEvaluationSpMVinDoubleprecision
• Speedupsof– x1.86onmaximum andx1.29onaverage comparedtocuSPARSE– x2.59onmaximum andx1.83onaverage comparedtoCSR
• Memoryaccesstovaluedataislargerandcompressionratebecomesworsecomparedtosingleprecision
23
0
10
20
30
40
50
G3_circuit offshore Dubcova3 af_shell3 bmw7st_1 shipsec5 audikw_1
GFLOPS
SMALLEELARGE
cuSPARSE SELLGCGσ (S,ENUS)GSELLGCGσ AMB
-
PerformanceEvaluationSpMVinSingle-precision
• PerformanceofeachlevelofAMBformat– MemoryaccesstomatrixdataincreasesinLevel1 whilethelocalityoftheaccesstoinputvectorisimproved
– ReducingmemoryaccesstomatrixdatainLevel2andLevel3contributesthespeedups
24
0102030405060708090
GFLOPS
SMALL33LARGE
SELL5C5σ AMB3(Level31,323w/o3compression)
AMB3(Level31,32) AMB3(Full3Level)
-
TotalMemoryAccess inSpMV
25
• EvaluatingtotalmemoryaccessinSpMV byusingnvprof– FormatwithminimumtotalmemoryaccessshowsbestperformanceofSpMV formostofmatrix
0
0.2
0.4
0.6
0.8
1
1.2
G3_circuit offshore Dubcova3 af_shell3 bmw7st_1 shipsec5 audikw_1
MemoryCTrafficCRatio
comparedCtoCCSR
SMALLCCLARGE
SELLKCKσ yaSpMV AMBC(LevelC1,C2) AMB
better
-
EstimationofTotalMemoryAccessandPerformancePrediction
• VeryaccurateestimationoftotalmemoryaccessinSpMV withAMBformat
26
0
100
200
300
400
500
600
0 100 200 300 400 500 600
Estimated0Total0Memory0Access0[MB]
Measured0Total0Memory0Access0[MB]
Correlationcoefficient=0.999
-
EstimationofTotalMemoryAccessandPerformancePrediction
• PerformanceofSpMVdependsonmemoryaccess– Performancedegradationbyload-imbalanceforsomedata
27
0
0.5
1
1.5
2
2.5
3
0 100 200 300 400 500 600
Execution2Time2of2SpMV2with2AMB2
format2[msec]
Estimated2Total2Memory2Access2[MB]
-
RelatedWork(1)
• CoAdELL [Maggioni,IPDPS2014]– Deltabetweencolumnindicesiscompressedto16-bitor8-bit
– Nocompressionwhendeltaislarge• BROformat[Tang,SC13]– Deltabetweencolumnindicesisrepresentedinmoreprecisenumberofbits
– OptimizedforGPUbyremovingsequentiallyexecution
– AlthoughBROsuccessfullyreducesthememoryfootprint,itrequiresadditionaldecompressionschemes
28
0 2 3
0 1
1
3 4
4
5
5
5
5
6
6
7
7
0 2 1
0 1
1
3 1
4
1
1
4
4
1
1
1
2
Column index
0
5
10
15
20
25
30
35
shipsec1 consph pdb1HYSGFLOPS
Matrix
BRO?ELL AMB
-
RelatedWork(2)
• BCCOO(yaSpMV)[Yan,PPoPP’14]– StoringthematrixdatainextendedCOOwithblockingthematrixtoreducethememoryusageoftheindexofcolumnandrow• LocalmemorysuchassharedmemoryonGPUiseffectivelyutilizedwhenthecomputationresultofeachblockisaccumulatedinSpMV
=> HighlyacceleratingSpMV fromexistinglibrarysuchascuSPARSE
– Parameterauto-tuningmechanism• Findbestsetofparametersbyconversionsandexecutions• Largeparameterspace
– Costofparametertuningisextremelyhigh
29
-
Conclusion
• ToimprovetheperformanceofSpMVonGPU,weproposeAMB(AdaptiveMulti-levelBlocking)formatwhichreducestotalmemoryaccess– Multi-leveldivisionandblockingimprovethelocalityoftheaccesstoinputvectorandcompressescolumnindex
– HighperformanceimprovementfromexistingSpMVlibraries suchascuSPARSEandyaSpMV
• Futurework– InvestigatingtheeffectivenessofourAMBformatonothermany-coreprocessorssuchasAMDGPUsandXeonPhi
30
-
Backup
31
-
Zero-fillinginValuedata
• Zero-fillinginvaluedataisnotsomany– Tableshowstheresultofsingleprecision
32
Matrix Bestblocksize #zero-fill/#non-zero[%] Total memoryaccessof(Bestblocksize)/(blocksize=1)[%]
af_shell3 5 0.096 75.62
audikw_1 3 0.451 79.77
bmw7st_1 6 4.709 82.22
Dubcova3 2 21.691 99.97
G3_circuit 1 0.070 100.00
offshore 1 0.212 100.00
shipsec5 6 1.731 74.92
-
Loopunrolling• LoopunrollingtechniqueisusedforSpMV– DivergenceeffectislargeonGPU
33
Blocksize=1
sum=0;adr=thread_id;for(i =0;i <3;i++){sum+=value[adr]*vec[col[adr]];adr +=4;
}output+=sum;
Blocksize=3
sum=0;adr =thread_idfor(i =0;i <3/3;i++){c=col[adr /3+adr %4];sum+=value[adr]*vec[c];sum+=value[adr +4*1]*vec[c+1];sum+=value[adr +4*2]*vec[c+2];adr +=4;
}output+=sum;
Thread0
Thread1
Thread2
Thread3Valuedata
0
0
0
1
1
1
2
0Columnindex
Thread0
Thread1
Thread2
Thread3
0
0
0
0ColumnindexValuedata
-
TotalMemoryAccessandExecutiontime
• Exectution timedependsontotalmemoryaccessinSELL-C- σ,yaSpMV andAMB– AMBalsoachieveshighthroughputalthoughtotalmemoryaccessislessthanotherformats
340
1
2
3
4
5
6
7
8
9
10
0 100 200 300 400 500 600 700 800 900 1000
Execution4Time4of4SpMV4[msec]
Measured4Memory4Traffic4[MB]
CSR SELLGCGσ yaSpMV AMB