linear algebraic primitives for parallel computing on ...aydin/buluc_phddefense.pdf · a....
TRANSCRIPT
1
Linear Algebraic Primitives for Parallel Computing on Large Graphs
BuluçPh.D. Defense March 16, 2010
Committee: John R. Gilbert (Chair)ÖmerFred ChongShivkumar Chandrasekaran
Sources of Massive Graphs
(WWW snapshot, courtesy Y. Hyun) (Yeast protein interaction network, courtesy H. Jeong)
Graphs naturally arise from the internet and social interactions
Many scientific (biological, chemical,cosmological, ecological, etc) datasets are modeled as graphs.
Examples:- Centrality- Shortest paths- Network flows- Strongly Connected
Components
Examples:
- Loop and multi edge removal- Triangle/Rectangle enumeration
3
Types of Graph Computations
Fuzzy intersectionExamples: Clustering,
Algebraic Multigrid
Tool: GraphTraversal Tool: Map/Reduce
1 3
4
2 5
7
6
Tightly coupled
Filtering based
map map map
reduce reduce
Tightly Coupled Computations
A. Computationally intensive graph/data mining algorithms.
(e.g. graph clustering, centrality, dimensionality reduction)
B. Inherently latency-bound computations.
(e.g. finding s-t connectivity)
4
Huge Graphs Expensive Kernels+ High Performance and Massive Parallelism
Tightly Coupled Computations
Let the special purpose architectures handle (B), but what can we do for (A) with the commodity architectures?
We need memory ! Disk is not an option !
555
Software for Graph Computation
spending ten years of my life on the TeX project is that software is hard. It's harder than anything
666
Software for Graph Computation
spending ten years of my life on the TeX project is that software is hard. It's harder than anything
Dealing with software is hard !
777
Software for Graph Computation
spending ten years of my life on the TeX project is that software is hard. It's harder than anything
Dealing with software is hard !
High performance computing (HPC)
software is harder !
888
Software for Graph Computation
spending ten years of my life on the TeX project is that software is hard. It's harder than anything
Dealing with software is hard !
High performance computing (HPC)
software is harder !
Deal with parallel HPC software?
Tightly Coupled Computations
9
Parallel Graph Software
Shared memory scales, not it pays off to target productivity
Distributed memory should be targeting p > 100 at least
Shared and distributed memory complement each other.
Coarse-grained parallelism for distributed memory.
Outline
The Case for PrimitivesThe Case for Sparse Matrices
Parallel Sparse Matrix-Matrix Multiplication
Software Design of the Combinatorial BLAS
An Application in Social Network Analysis
Other Work
Future Directions
10
11
Input:Find least-cost paths between all reachable vertex pairsClassical algorithm: Floyd-Warshall
Case study of implementation on multicore architecture:graphics processing unit (GPU)
All-Pairs Shortest Paths
for k=1:n // the induction sequencefor i = 1:n
for j = 1:nif(w(i k)+w(k ) < w(i ))
w(i ):= w(i k) + w(k )
1 52 3 41
5
2
3
4
k = 1 case
12
GPU characteristics
Powerful: two Nvidia 8800s > 1 TFLOPS
Inexpensive: $500 each
Difficult programming model: One instruction stream drives 8 arithmetic units
Performance is counterintuitive and fragile:Memory access pattern has subtle effects on cost
Extremely easy to underutilize the device:Doing it wrong easily costs 100x in time
t1
t3t2
t4
t6t5
t7
t9t8
t10
t12t11
t13t14
t16t15
But:
13
Recursive All-Pairs Shortest Paths
A B
C DA
BD
C
A = A*; % recursive callB = AB; C = CA; D = D + CB;D = D*; % recursive callB = BD; C = DC;A = A + BC;
+ ×
Based on R-Kleene algorithm
Well suited for GPU architecture:Fast matrix-multiply kernelIn-place computation => low memory bandwidthFew, large MatMul calls => low GPU dispatch overheadRecursion stack on host CPU, not on multicore GPUCareful tuning of GPU code
The Case for Primitives
14
480x
Lifting Floyd-Warshallto GPU
The right primitive !
Conclusions:
High performance is achievable but not simple
Carefully chosen and optimized primitives will be key
Unorthodox R-Kleene algorithm
APSP: Experiments and observations
Outline
The Case for Primitives
The Case for Sparse MatricesParallel Sparse Matrix-Matrix Multiplication
Software Design of the Combinatorial BLAS
An Application in Social Network Analysis
Other Work
Future Directions
15
16
Sparse Adjacency Matrix and Graph
Every graph is a sparse matrix and vice-versa
Adjacency matrix: sparse array w/ nonzeros for graph edges
Storage-efficient implementation from sparse data structures
x
1 2
3
4 7
6
5
AT
The Case for Sparse Matrices
Many irregular applications contain su cient coarse-grained parallelism that can ONLY be exploited using abstractions at proper level.
17
Traditional graph computations
Graphs in the language of linear algebra
Data driven. Unpredictable communication.
Fixed communication patterns.
Irregular and unstructured. Poor locality of reference
Operations on matrix blocks. Exploits memory hierarchy
Fine grained data accesses. Dominated by latency
Coarse grained parallelism. Bandwidth limited
The Case for Sparse Matrices
Identification of Primitives
Sparse matrix-matrix Multiplication (SpGEMM)
Element-wise operations
18
Linear Algebraic Primitives
x
Matrices on semirings, e.g. ( , +), (and, or), (+, min)
Sparse matrix-vector multiplication
Sparse Matrix Indexing
x
.*
Why focus on SpGEMM?
Graph clustering (MCL, peer pressure)
Subgraph / submatrix indexingShortest path calculations
Betweenness centralityGraph contraction
Cycle detection
Multigrid interpolation & restriction
Colored intersection searching
Applying constraints in finite element computations
Context-free parsing ...
19
Applications of Sparse GEMM
11
11
1x x
Loop until convergenceA = A * A; % expandA = A .^ 2; % inflate sc = 1./sum(A, 1);A = A * diag(sc); % renormalizeA(A < 0.0001) = 0; % prune entries
Outline
The Case for Primitives
The Case for Sparse Matrices
Parallel Sparse Matrix-Matrix MultiplicationSoftware Design of the Combinatorial BLAS
An Application in Social Network Analysis
Other Work
Future Directions
20
21
Two Versions of Sparse GEMM
A1 A2 A3 A4 A7A6A5 A8 B1 B2 B3 B4 B7B6B5 B8 C1 C2 C3 C4 C7C6C5 C8
j
x =i
kk
Cij
Cij += Aik Bkj
Ci = Ci + A Bi
x =1D block-column distribution
Checkerboard (2D block) distribution
Comparative Speedup of Sparse 1D & 2D
22
In practice, 2D algorithms have the potential to scale, but not linearly
Projected performances of Sparse 1D & 2D
N P
1D 2D
NP
pppncncpcnnc
TTpWE
commcomp22
2
lg)()(
Stores entries in column-major order
Dense collection of
Uses storage.23
Compressed Sparse Columns (CSC):A Standard Layout
Column pointers
data
810
2 3rowindn×n matrix with nnz nonzeroes
n
0 2 3 4 0 1 5 7 3 4 5 4 5 6 7
04
1112131617
nnznO )(
24
Submatrices are hypersparse (i.e. nnz << n)
blocks
blocks
Total Storage:
Average of c nonzeros per column
A data structure or algorithm that depends on the matrix dimension n (e.g. CSR or CSC) is asymptotically too wasteful for submatrices
Node Level Considerations
Sequential Kernel
Strictly O(nnz) data structure Outer-product formulation Work-efficient
25
X
flops
nnz
n
Standard
Sequential Hypersparse Kernel
))()(lg( BnzrAnzcnif lops
))(( mnBnnzf lops
New hypersparse kernel:
26
Scaling Results for SpGEMM
RMat X RMat product (graphs with high variance on degrees)
Random permutations useful for the overall computation (<10% imbalance).
Bulk synchronous algorithms may still suffer due to imbalance within the stages. But this goes away as matrices get larger
0
20
40
60
80
100
120
140
160
180
200
0 200 400 600 800 1000 1200 1400 1600
Speedup
Processors
SCALE 21 -‐ 16way
SCALE 22 -‐ 16way
SCALE 22 -‐ 4way
SCALE 23 -‐ 16way
Outline
The Case for Primitives
The Case for Sparse Matrices
Parallel Sparse Matrix-Matrix Multiplication
Software Design of the Combinatorial BLASAn Application in Social Network Analysis
Other Work
Future Directions
27
28
Software design of theCombinatorial BLAS
Generality, of the numeric type of matrix elements, algebraic operation performed, and the library interface.Without the language abstraction penalty: C++ Templates
Achieve mixed precision arithmetic: Type traitsEnforcing interface and strong type checking: CRTPGeneral semiring operation: Function Objects
template <class IT, class NT, class DER>class SpMat;
Abstraction penalty is not just a programming language issue.
In particular, view matrices as indexed data structures and stay away from single element access (Interface should discourage)
29
Extendability, of the library while maintaining compatibility and seamless upgrades.➡ Decouple parallel logic from the sequential part.
TuplesCSC DCSC
SpSeq
Commonalities:- Support the sequential API- Composed of a number of arrays
SpSeq
SpPar<Comm, SpSeq>
Any parallel logic: asynchronous, bulk synchronous, etc
Software design of theCombinatorial BLAS
Each *may* support an iterator concept
Outline
The Case for Primitives
The Case for Sparse Matrices
Parallel Sparse Matrix-Matrix Multiplication
Software Design of the Combinatorial BLAS
An Application in Social Network AnalysisOther Work
Future Directions
30
Applications and Algorithms
31
Betweenness Centrality (BC)CB(v): Among all the shortest paths, what fraction of them pass through the node of interest?
Brandes
A typical software stack for an application enabled with the Combinatorial BLAS
Social Network Analysis
32
6
XAT (ATX).*¬X
1 2
3
4 7 5
Betweenness Centrality using Sparse GEMM
Parallel breadth-first search is implemented with sparse matrix-matrix multiplication
Work efficient algorithm for BC
33
BC Performance onDistributed-memory
TEPS: Traversed Edges Per SecondBatch of 512 vertices at each iterationCode only a few lines longer than Matlab version
0
50
100
150
200
250
25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400 441 484
TEPS score
Millions
Number of Cores
BC performance
Scale 17Scale 18Scale 19Scale 20
Input: RMAT scale N2N verticesAverage degree 8
Pure MPI-1 version. No reliance on any particular hardware.
34
More scaling and sensitivity (Betweenness Centrality)
Input is an RMat graph of scale 22
0
50
100
150
200
250
300
350
400
196 256 324 400 484 576 676 784 900
Millions of TEPS
Number of Processing Cores
Sensitivity to batch processing
Batch=128 Batch=256 Batch=512 Batch=1024Experiment with 512 batch nodes
0
50
100
150
200
250
300
350
400
196 366 536 706 876 1046 1216
Scaling beyond hundreds
Scale 21 Scale 22 Scale 23 Scale 24
Outline
The Case for Primitives
The Case for Sparse Matrices
Parallel Sparse Matrix-Matrix Multiplication
Software Design of the Combinatorial BLAS
An Application in Social Network Analysis
Other WorkFuture Directions
35
36
SpMV on Multicore
Our parallel algorithms for y ATx thenew compressed sparse blocks (CSB) layout have
span, and work, yielding parallelism.)lg/( nnnnz
)(nnz)lg( nn
0
100
200
300
400
500
600
1 2 3 4 5 6 7 8
MFlops/sec
Processors
Our CSB algorithms
Star-P (CSR+blockrow distribution)
Serial (Naïve CSR)
Outline
The Case for Primitives
The Case for Sparse Matrices
Parallel Sparse Matrix-Matrix Multiplication
Software Design of the Combinatorial BLAS
An Application in Social Network Analysis
Other Work
Future Directions
37
3838
Future Directions
‣Novel scalable algorithms
‣Static graphs are just the beginning.Dynamic graphs, Hypergraphs, Tensors
‣ Architectures (mainly nodes) are evolvingHeterogeneous multicoresHomogenous multicores with more cores per node
TACC Lonestar (2006)
4 cores / node
TACC Ranger (2008)
16 cores / node
SDSC Triton (2009)
32 cores / nodeXYZ Resource (2020)
Hierarchical parallelism
3939
‣What did I learn about parallel graph computations?
‣No silver bullets.
‣Graph computations are not homogenous.
‣No free lunch.
‣ If communication >>computation for your algorithm, overlapping them will NOT make it scale.
‣Scale your algorithm, not your implementation.
‣ Any sustainable speedup is a success !
‣
40
Related Publications
Hypersparsity in 2D decomposition, sequential kernel.B., Gilbert, "On the Representation and Multiplication of Hypersparse
Parallel analysis of sparse GEMM, synchronous implementationB., Gilbert, "Challenges and Advances in Parallel Sparse Matrix-
The case for primitives, APSP on the GPUB., Gilbert, Budak, , Parallel Computing, 2009
SpMV on MulticoresB., Fineman, Frigo, Gilbert, Leiserson, "Parallel Sparse Matrix-Vector and Matrix-Transpose-
Betweenness centrality resultsB., Gilbert, -
Software design of the libraryB., Gilbert,
41
David Bader, Erik Boman, Ceren Budak, Alan Edelman, Jeremy Fineman, MatteoFrigo, Bruce Hendrickson, Jeremy
Kepner, Charles Leiserson, KameshMadduri, Steve Reinhardt, Eric Robinson, Viral
Shah, Sivan Toledo