parallel combinatorial computing and sparse matrices

24
Parallel Combinatorial Computing and Sparse Matrices E. Jason Riedy James Demmel Computer Science Division EECS Department University of California, Berkeley SIAM Conference on Computational Science and Engineering, 2005

Upload: jason-riedy

Post on 11-May-2015

606 views

Category:

Technology


2 download

DESCRIPTION

Re-defining scalability for irregular problems, starting with bipartite matching.

TRANSCRIPT

Page 1: Parallel Combinatorial Computing and Sparse Matrices

Parallel Combinatorial Computing and SparseMatrices

E. Jason Riedy James Demmel

Computer Science DivisionEECS Department

University of California, Berkeley

SIAM Conference on Computational Science andEngineering, 2005

Page 2: Parallel Combinatorial Computing and Sparse Matrices

Outline

Fundamental question: What performance metrics are right?

Background

AlgorithmsSparse TransposeWeighted Bipartite Matching

Setting Performance Goals

Ordering Ideas

Page 3: Parallel Combinatorial Computing and Sparse Matrices

App: Sparse LU Factorization

Characteristics:I Large quantities of numerical work.I Eats memory and flops.I Benefits from parallel work.I And needs combinatorial support.

Page 4: Parallel Combinatorial Computing and Sparse Matrices

App: Sparse LU Factorization

Characteristics:I Large quantities of numerical work.I Eats memory and flops.I Benefits from parallel work.I And needs combinatorial support.

Page 5: Parallel Combinatorial Computing and Sparse Matrices

App: Sparse LU Factorization

Combinatorial support:

I Fill-reducing orderings, pivotavoidance, data structures.

I Numerical work is distributed.I Supporting algorithms need to be

distributed.I Memory may be cheap ($100 GB),

moving data is costly.

Page 6: Parallel Combinatorial Computing and Sparse Matrices

Sparse Transpose

Data structure manipulation

I Dense transpose moves numbers, sparse moves numbersand re-indexes them.

I Sequentially space-efficient “algorithms” exist, but disagreewith most processors.

I Chains of data-dependent loadsI Unpredictable memory patterns

If the data is already distributed, an unoptimized paralleltranspose is better than an optimized sequential one!

Page 7: Parallel Combinatorial Computing and Sparse Matrices

Parallel Sparse Transpose

P1 P2

P3P4Many, many options:

I Send to root, transpose, distribute.I Transpose, send pieces to destinations.I Transpose, then rotate data.I Replicate the matrix, transpose

everywhere.

Communicates most of matrix twice. Node stores whole matrix.

Note: We should compare with this implementation, not purelysequential.

Page 8: Parallel Combinatorial Computing and Sparse Matrices

Parallel Sparse Transpose

P1 P2

P3P4Many, many options:

I Send to root, transpose, distribute.I Transpose, send pieces to destinations.I Transpose, then rotate data.I Replicate the matrix, transpose

everywhere.

All-to-all communication. Some parallel work.

Page 9: Parallel Combinatorial Computing and Sparse Matrices

Parallel Sparse Transpose

P1 P2

P3P4Many, many options:

I Send to root, transpose, distribute.I Transpose, send pieces to destinations.I Transpose, then rotate data.I Replicate the matrix, transpose

everywhere.

Serial communication, but may hide latencies.

Page 10: Parallel Combinatorial Computing and Sparse Matrices

Parallel Sparse Transpose

P1 P2

P3P4Many, many options:

I Send to root, transpose, distribute.I Transpose, send pieces to destinations.I Transpose, then rotate data.I Replicate the matrix, transpose

everywhere.

Useful in some circumstances.

Page 11: Parallel Combinatorial Computing and Sparse Matrices

What Data Is Interesting?

I Time

I How much data is communicated.I Overhead and latency.I Quantity of data resident on a processor.

Page 12: Parallel Combinatorial Computing and Sparse Matrices

What Data Is Interesting?

I Time (to solution)I How much data is communicated.I Overhead and latency.I Quantity of data resident on a processor.

Page 13: Parallel Combinatorial Computing and Sparse Matrices

Parallel Transpose PerformanceTranspose

Number of Processors

Spe

ed−

up

5 10 15 20

0.0

0.5

1.0

1.5

●●●●●

●●●●●

●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●● ●●●

●●●●● ●●●●●● ●●●●●●●●●●

●●●●● ●●●

●●●●●

●●●●●●●●●●●

●●●●●

●●●

●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●● ●

●●●●●

●●●●●

●●●●●

●●●●●

●●●

●●

●●●●●

●●●●●

●●●

●●●●●

●●●●●

●●●

●●●●●

●●●●●●●●●●

●●●●●

●●

●●●

●●●●●

●●●

●●●●●

●●●●●

●●●●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●

●●●

●●●●●

●●

●●

●●●●●

●●●●●

●●●

●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●●●●

●●●●● ●●●●●

●●●●●

●●●●●

●●●

●●

●●●●●

●●●●●

●●●●●

●●●●●

●●

●●●

●●

●●

●●●●●

●●●●●

●●

●●●

●●●●●

●●●●●

●●

●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●●●●

●●

●●●●●

●●●●●

●●●●●

●●●●● ●●●●●

●●● ●●●

●●●●

●●●●● ●●●●●

Transpose v. root

Number of Processors

Re−

scal

ed s

peed

−up

5 10 15

0

2

4

6

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●

●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●

●●●●

●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●

●●●

●●●●

●●●●

●●●

●●

●●●

●●●●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●

●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●

●●●●●

●●●●●

●●

●●●●●●●●●●●●●●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●

●●

●●●●

●●●●

●●

●●●●●

●●

●●●

●●●●●

●●●●●●●●●●●●●●●

●●●

●●

●●●

●●

●●●

●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●

●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●●●

●●●

●●

●●●

●●●●●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●

●●●●●●●●●●●●●●●

●●

●●●

●●

●●●

●●

●●●

●●●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●

●●●

●●

●●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●

●●

●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●●

●●

●●●

●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●●●●●

●●●●●

●●●●●

●●●●●●●●●●●●●●●

●●

●●●

●●

●●●

●●

●●●●●●●●●

●●●●

●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

I All-to-all is slower than pure sequential code, butdistributed.

I Actual speed-up when the data is already distributed.I Hard to keep constant size / node when performance

varies by problem.

Data from CITRIS cluster, Itanium2s with Myrinet.

Page 14: Parallel Combinatorial Computing and Sparse Matrices

Weighted Bipartite Matching

I Not moving data around but findingwhere it should go.

I Find the “best” edges in a bipartitegraph.

I Corresponds to picking the “best”diagonal.

I Used for static pivoting in factorization.I Also in travelling salesman problems,

etc.

Page 15: Parallel Combinatorial Computing and Sparse Matrices

Weighted Bipartite Matching

I Not moving data around but findingwhere it should go.

I Find the “best” edges in a bipartitegraph.

I Corresponds to picking the “best”diagonal.

I Used for static pivoting in factorization.I Also in travelling salesman problems,

etc.

Page 16: Parallel Combinatorial Computing and Sparse Matrices

Algorithms

Depth-first search

I Reliable performance, code available (MC64)I Requires A and AT .I Difficult to compute on distributed data.

Interior point

I Performance varies wildly; many tuning parameters.I Full generality: Solve larger sparse system.I Auction algorithms replace solve with iterative bidding.I Easy to distribute.

Page 17: Parallel Combinatorial Computing and Sparse Matrices

Parallel Auction Performance

Auction

Number of Processors

Spe

edup

5 10 15

2

4

6

8

●●

●●

●●

● ●

●●

●● ● ●

●●

●●

●●

● ●

●●

● ●

● ●

● ●

●●

●●

● ●●

● ●

●● ●

● ● ●

●●

●●

●●

Auction v. root

Number of Processors

Re−

scal

ed s

peed

up

5 10 15

2

4

6

8

●●

●●

● ●

●● ●●

●●

●●

●●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

Compare with running an auction on the root, a parallel auctionachieves slight speed-up.

Page 18: Parallel Combinatorial Computing and Sparse Matrices

Proposed Performance Goals

When is a distributed combinatorial algorithm (or code)successful?

I Does not redistribute the data excessively.I Keeps the data distributed.I No process sees more than the whole.I Performance is competitive with the on-root option.

Pure speed-up is a great goal, but not always reasonable.

Page 19: Parallel Combinatorial Computing and Sparse Matrices

Distributed Matrix Ordering

Finding a permutation of columns and rows to reduce fill.

NP-hard to solve, difficult to approximate.

Page 20: Parallel Combinatorial Computing and Sparse Matrices

Sequential Orderings

Bottom-up

I Pick columns (and possibly rows) in sequence.I Heuristic choices:

I Minimum degree, deficiency, approximationsI Maintain symbolic factorization

Top-down

I Separate the matrix into multiple sections.I Graph partitioning: A + AT , AT · A, A

I Needs vertex separators: Difficult.

Top-down Hybrid

I Dissect until small, then order.

Page 21: Parallel Combinatorial Computing and Sparse Matrices

Parallel Bottom-up Hybrid Order

1. Separate the graph into chunks.I Needs an edge separator,I and knowledge of the pivot.

2. Order each chunk separately.I Forms local partial orders.

3. Merge the orders.I What needs communicated?

Page 22: Parallel Combinatorial Computing and Sparse Matrices

Merging Partial Orders

Respecting partial orders

I Local, symbolic factorization doneonce.

I Only need to communicate quotientgraph.

I Quotient graph: Implicit edges forSchur complements.

I No node will communicate more thanthe whole matrix.

Page 23: Parallel Combinatorial Computing and Sparse Matrices

Preliminary Quality Results

Merging heuristic

I Pairwise merge.I Pick the head pivot with least worst-case fill (Markowitz

cost).

Small (tiny) matrices: performance not reliable.

Matrix Method NNZ increasewest2021 AMD (A + AT ) 1.51×

merging 1.68×orani678 AMD 2.37×

merging 6.11×Increasing the numerical work drastically spends any savingsfrom computing a distributed order. Need better heuristics?

Page 24: Parallel Combinatorial Computing and Sparse Matrices

Summary

I Meeting classical expectations of scaling is difficult.I Relatively small amounts of computation for much

communication.I Problem-dependent performance makes equi-size scaling

hard.I But consolidation costs when data is already distributed.

In a supporting role, don’t sweat the speed-up.Keep the problem distributed.

Open topics

I Any new ideas for parallel ordering?