linear algebraic primitives for parallel computing on ...aydin/buluc_phddefense.pdf · a....

1

Linear Algebraic Primitives for Parallel Computing on Large Graphs

BuluçPh.D. Defense March 16, 2010

Committee: John R. Gilbert (Chair)ÖmerFred ChongShivkumar Chandrasekaran

Sources of Massive Graphs

(WWW snapshot, courtesy Y. Hyun) (Yeast protein interaction network, courtesy H. Jeong)

Graphs naturally arise from the internet and social interactions

Many scientific (biological, chemical,cosmological, ecological, etc) datasets are modeled as graphs.

Examples:- Centrality- Shortest paths- Network flows- Strongly Connected

Components

Examples:

- Loop and multi edge removal- Triangle/Rectangle enumeration

3

Types of Graph Computations

Fuzzy intersectionExamples: Clustering,

Algebraic Multigrid

Tool: GraphTraversal Tool: Map/Reduce

1 3

4

2 5

7

6

Tightly coupled

Filtering based

map map map

reduce reduce

Tightly Coupled Computations

A. Computationally intensive graph/data mining algorithms.

(e.g. graph clustering, centrality, dimensionality reduction)

B. Inherently latency-bound computations.

(e.g. finding s-t connectivity)

4

Huge Graphs Expensive Kernels+ High Performance and Massive Parallelism


Let the special purpose architectures handle (B), but what can we do for (A) with the commodity architectures?

We need memory ! Disk is not an option !

555

Software for Graph Computation

spending ten years of my life on the TeX project is that software is hard. It's harder than anything

666



Dealing with software is hard !

777




High performance computing (HPC)

software is harder !

888




High performance computing (HPC)

software is harder !

Deal with parallel HPC software?


9

Parallel Graph Software

Shared memory scales, not it pays off to target productivity

Distributed memory should be targeting p > 100 at least

Shared and distributed memory complement each other.

Coarse-grained parallelism for distributed memory.

Outline

The Case for PrimitivesThe Case for Sparse Matrices

Parallel Sparse Matrix-Matrix Multiplication

Software Design of the Combinatorial BLAS

An Application in Social Network Analysis

Other Work

Future Directions

10

11

Input:Find least-cost paths between all reachable vertex pairsClassical algorithm: Floyd-Warshall

Case study of implementation on multicore architecture:graphics processing unit (GPU)

All-Pairs Shortest Paths

for k=1:n // the induction sequencefor i = 1:n

for j = 1:nif(w(i k)+w(k ) < w(i ))

w(i ):= w(i k) + w(k )

1 52 3 41

5

2

3

4

k = 1 case

12

GPU characteristics

Powerful: two Nvidia 8800s > 1 TFLOPS

Inexpensive: $500 each

Difficult programming model: One instruction stream drives 8 arithmetic units

Performance is counterintuitive and fragile:Memory access pattern has subtle effects on cost

Extremely easy to underutilize the device:Doing it wrong easily costs 100x in time

t1

t3t2

t4

t6t5

t7

t9t8

t10

t12t11

t13t14

t16t15

But:

13

Recursive All-Pairs Shortest Paths

A B

C DA

BD

C

A = A*; % recursive callB = AB; C = CA; D = D + CB;D = D*; % recursive callB = BD; C = DC;A = A + BC;

+ ×

Based on R-Kleene algorithm

Well suited for GPU architecture:Fast matrix-multiply kernelIn-place computation => low memory bandwidthFew, large MatMul calls => low GPU dispatch overheadRecursion stack on host CPU, not on multicore GPUCareful tuning of GPU code

The Case for Primitives

14

480x

Lifting Floyd-Warshallto GPU

The right primitive !

Conclusions:

High performance is achievable but not simple

Carefully chosen and optimized primitives will be key

Unorthodox R-Kleene algorithm

APSP: Experiments and observations

Outline


The Case for Sparse MatricesParallel Sparse Matrix-Matrix Multiplication



Other Work

Future Directions

15

16

Sparse Adjacency Matrix and Graph

Every graph is a sparse matrix and vice-versa

Adjacency matrix: sparse array w/ nonzeros for graph edges

Storage-efficient implementation from sparse data structures

x

1 2

3

4 7

6

5

AT

The Case for Sparse Matrices

Many irregular applications contain su cient coarse-grained parallelism that can ONLY be exploited using abstractions at proper level.

17

Traditional graph computations

Graphs in the language of linear algebra

Data driven. Unpredictable communication.

Fixed communication patterns.

Irregular and unstructured. Poor locality of reference

Operations on matrix blocks. Exploits memory hierarchy

Fine grained data accesses. Dominated by latency

Coarse grained parallelism. Bandwidth limited


Identification of Primitives

Sparse matrix-matrix Multiplication (SpGEMM)

Element-wise operations

18

Linear Algebraic Primitives

x

Matrices on semirings, e.g. ( , +), (and, or), (+, min)

Sparse matrix-vector multiplication

Sparse Matrix Indexing

x

.*

Why focus on SpGEMM?

Graph clustering (MCL, peer pressure)

Subgraph / submatrix indexingShortest path calculations

Betweenness centralityGraph contraction

Cycle detection

Multigrid interpolation & restriction

Colored intersection searching

Applying constraints in finite element computations

Context-free parsing ...

19

Applications of Sparse GEMM

11

11

1x x

Loop until convergenceA = A * A; % expandA = A .^ 2; % inflate sc = 1./sum(A, 1);A = A * diag(sc); % renormalizeA(A < 0.0001) = 0; % prune entries

Outline



Parallel Sparse Matrix-Matrix MultiplicationSoftware Design of the Combinatorial BLAS


Other Work

Future Directions

20

21

Two Versions of Sparse GEMM

A1 A2 A3 A4 A7A6A5 A8 B1 B2 B3 B4 B7B6B5 B8 C1 C2 C3 C4 C7C6C5 C8

j

x =i

kk

Cij

Cij += Aik Bkj

Ci = Ci + A Bi

x =1D block-column distribution

Checkerboard (2D block) distribution

Comparative Speedup of Sparse 1D & 2D

22

In practice, 2D algorithms have the potential to scale, but not linearly

Projected performances of Sparse 1D & 2D

N P

1D 2D

NP

pppncncpcnnc

TTpWE

commcomp22

2

lg)()(

Stores entries in column-major order

Dense collection of

Uses storage.23

Compressed Sparse Columns (CSC):A Standard Layout

Column pointers

data

810

2 3rowindn×n matrix with nnz nonzeroes

n

0 2 3 4 0 1 5 7 3 4 5 4 5 6 7

04

1112131617

nnznO )(

24

Submatrices are hypersparse (i.e. nnz << n)

blocks

blocks

Total Storage:

Average of c nonzeros per column

A data structure or algorithm that depends on the matrix dimension n (e.g. CSR or CSC) is asymptotically too wasteful for submatrices

Node Level Considerations

Sequential Kernel

Strictly O(nnz) data structure Outer-product formulation Work-efficient

25

X

flops

nnz

n

Standard

Sequential Hypersparse Kernel

))()(lg( BnzrAnzcnif lops

))(( mnBnnzf lops

New hypersparse kernel:

26

Scaling Results for SpGEMM

RMat X RMat product (graphs with high variance on degrees)

Random permutations useful for the overall computation (<10% imbalance).

Bulk synchronous algorithms may still suffer due to imbalance within the stages. But this goes away as matrices get larger

0

20

40

60

80

100

120

140

160

180

200

0 200 400 600 800 1000 1200 1400 1600

Speedup

Processors

SCALE 21 -‐ 16way

SCALE 22 -‐ 16way

SCALE 22 -‐ 4way

SCALE 23 -‐ 16way

Outline




Software Design of the Combinatorial BLASAn Application in Social Network Analysis

Other Work

Future Directions

27

28

Software design of theCombinatorial BLAS

Generality, of the numeric type of matrix elements, algebraic operation performed, and the library interface.Without the language abstraction penalty: C++ Templates

Achieve mixed precision arithmetic: Type traitsEnforcing interface and strong type checking: CRTPGeneral semiring operation: Function Objects

template <class IT, class NT, class DER>class SpMat;

Abstraction penalty is not just a programming language issue.

In particular, view matrices as indexed data structures and stay away from single element access (Interface should discourage)

29

Extendability, of the library while maintaining compatibility and seamless upgrades.➡ Decouple parallel logic from the sequential part.

TuplesCSC DCSC

SpSeq

Commonalities:- Support the sequential API- Composed of a number of arrays

SpSeq

SpPar<Comm, SpSeq>

Any parallel logic: asynchronous, bulk synchronous, etc

Software design of theCombinatorial BLAS

Each *may* support an iterator concept

Outline





An Application in Social Network AnalysisOther Work

Future Directions

30

Applications and Algorithms

31

Betweenness Centrality (BC)CB(v): Among all the shortest paths, what fraction of them pass through the node of interest?

Brandes

A typical software stack for an application enabled with the Combinatorial BLAS

Social Network Analysis

32

6

XAT (ATX).*¬X

1 2

3

4 7 5

Betweenness Centrality using Sparse GEMM

Parallel breadth-first search is implemented with sparse matrix-matrix multiplication

Work efficient algorithm for BC

33

BC Performance onDistributed-memory

TEPS: Traversed Edges Per SecondBatch of 512 vertices at each iterationCode only a few lines longer than Matlab version

0

50

100

150

200

250

25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400 441 484

TEPS score

Millions

Number of Cores

BC performance

Scale 17Scale 18Scale 19Scale 20

Input: RMAT scale N2N verticesAverage degree 8

Pure MPI-1 version. No reliance on any particular hardware.

34

More scaling and sensitivity (Betweenness Centrality)

Input is an RMat graph of scale 22

0

50

100

150

200

250

300

350

400

196 256 324 400 484 576 676 784 900

Millions of TEPS

Number of Processing Cores

Sensitivity to batch processing

Batch=128 Batch=256 Batch=512 Batch=1024Experiment with 512 batch nodes

0

50

100

150

200

250

300

350

400

196 366 536 706 876 1046 1216

Scaling beyond hundreds

Scale 21 Scale 22 Scale 23 Scale 24

Outline






Other WorkFuture Directions

35

36

SpMV on Multicore

Our parallel algorithms for y ATx thenew compressed sparse blocks (CSB) layout have

span, and work, yielding parallelism.)lg/( nnnnz

)(nnz)lg( nn

0

100

200

300

400

500

600

1 2 3 4 5 6 7 8

MFlops/sec

Processors

Our CSB algorithms

Star-P (CSR+blockrow distribution)

Serial (Naïve CSR)

Outline






Other Work

Future Directions

37

3838

Future Directions

‣Novel scalable algorithms

‣Static graphs are just the beginning.Dynamic graphs, Hypergraphs, Tensors

‣ Architectures (mainly nodes) are evolvingHeterogeneous multicoresHomogenous multicores with more cores per node

TACC Lonestar (2006)

4 cores / node

TACC Ranger (2008)

16 cores / node

SDSC Triton (2009)

32 cores / nodeXYZ Resource (2020)

Hierarchical parallelism

3939

‣What did I learn about parallel graph computations?

‣No silver bullets.

‣Graph computations are not homogenous.

‣No free lunch.

‣ If communication >>computation for your algorithm, overlapping them will NOT make it scale.

‣Scale your algorithm, not your implementation.

‣ Any sustainable speedup is a success !

‣

40

Related Publications

Hypersparsity in 2D decomposition, sequential kernel.B., Gilbert, "On the Representation and Multiplication of Hypersparse

Parallel analysis of sparse GEMM, synchronous implementationB., Gilbert, "Challenges and Advances in Parallel Sparse Matrix-

The case for primitives, APSP on the GPUB., Gilbert, Budak, , Parallel Computing, 2009

SpMV on MulticoresB., Fineman, Frigo, Gilbert, Leiserson, "Parallel Sparse Matrix-Vector and Matrix-Transpose-

Betweenness centrality resultsB., Gilbert, -

Software design of the libraryB., Gilbert,

41

David Bader, Erik Boman, Ceren Budak, Alan Edelman, Jeremy Fineman, MatteoFrigo, Bruce Hendrickson, Jeremy

Kepner, Charles Leiserson, KameshMadduri, Steve Reinhardt, Eric Robinson, Viral

Shah, Sivan Toledo

linear algebraic primitives for parallel computing on ...aydin/buluc_phddefense.pdf · a....

Documents