inspector-executor load balancing algorithms for block-sparse tensor contractions

32
Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions David Ozog*, Jeff R. Hammond , James Dinan , Pavan Balaji , Sameer Shende*, Allen Malony* *University of Oregon Argonne National Laboratory 2013 International Conference on Parallel Processing (ICPP) October 2, 2013

Upload: amanda

Post on 23-Feb-2016

56 views

Category:

Documents


0 download

DESCRIPTION

Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions. David Ozog *, Jeff R. Hammond † , James Dinan † , Pavan Balaji † , Sameer Shende *, Allen Malony * *University of Oregon † Argonne National Laboratory - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

Inspector-Executor Load Balancing Algorithms for Block-Sparse

Tensor Contractions

David Ozog*, Jeff R. Hammond†, James Dinan†, Pavan Balaji†, Sameer Shende*, Allen Malony*

*University of Oregon †Argonne National Laboratory

2013 International Conference on Parallel Processing (ICPP)October 2, 2013

Page 2: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

Outline

1. NWChem, Coupled Cluster, Tensor Contraction Engine2. Load Balance Challenges3. Dynamic Load Balancing with Global Arrays (GA)4. Nxtval Performance Experiments5. Inspector/Executor Design6. Performance Modeling (DGEMM and TCE Sort)7. Largest Processing Time (LPT) Algorithm8. Dynamic Buckets – Design and Implementation9. Results10. Conclusions 11. Future Work

Page 3: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

NWChem and Coupled ClusterNWChem:• Wide range of methods, accuracies, and

supported supercomputer architectures• Well-known for its support of many

quantum mechanical methods on massively parallel systems.

• Built on top of Global Arrays (GA) / ARMCI

Coupled Cluster (CC):• Ab initio - i.e., Highly accurate• Solves an approximate Schrödinger

Equation• Accuracy hierarchy: CCSD < CCSD(T) < CCSDT < CCSDT(Q) < CCSDTQ• The respective computational costs:

• And respective storage costs:

)()()()()( 109876 nOnOnOnOnO

)()()()()( 86644 nOnOnOnOnO

*Photos from nwchem-sw.org

Page 4: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

NWChem and Coupled Cluster

*Diagram from GA tutorial (ACTS 2009)

Global Address Space

Distributed Memory Spaces

NWChem:• Wide range of methods, accuracies, and

supported supercomputer architectures• Well-known for its support of many

quantum mechanical methods on massively parallel systems.

• Built on top of Global Arrays (GA) / ARMCI

Coupled Cluster (CC):• Ab initio - i.e., Highly accurate• Solves an approximate Schrödinger

Equation• Accuracy hierarchy: CCSD < CCSD(T) < CCSDT < CCSDT(Q) < CCSDTQ• The respective computational costs:

• And respective storage costs:

)()()()()( 109876 nOnOnOnOnO

)()()()()( 86644 nOnOnOnOnO

Page 5: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

DGEMM Tasks - Load Imbalance

• In CCSX (X=D,T,Q), 1 tensor contraction contains between 1 hundred and 1 million DGEMMs

• MFLOPs per task depend on: • number of atoms• Spin and spatial

symmetry • Accuracy of chosen basis• The tile size

Page 6: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

Computational Challenges

Benzene Water Clusters Macro-Molecules

Highly symmetric Asymmetric QM/MM

• Load balance is crucially important for performance• Obtaining optimal load balance is an NP-Hard problem.

*Photos from nwchem-sw.org

Page 7: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

GA Dynamic Load Balancing Template

Page 8: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

GA Dynamic Load Balancing Template

Page 9: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

GA Dynamic Load Balancing Template

Page 10: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

GA Dynamic Load Balancing Template

Page 11: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

GA Dynamic Load Balancing Template

Page 12: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

GA Dynamic Load Balancing Template

Page 13: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

GA Dynamic Load Balancing Template

Page 14: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

GA Dynamic Load Balancing Template

Page 15: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

GA Dynamic Load Balancing Template

Works best when:

• On a single node (in SysV shared memory)

• Time spent in FOO(a) is huge

• On high-speed interconnects

• Number of simultaneous calls is reasonably small (less than 1,000).

Page 16: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

Nxtval - Performance ExperimentsTAU Profiling• 14 water molecules, aug-cc-PVDZ

• 123 nodes, 8 ppn

• Nxtval consumes a large percentage of the execution time.

Flooding micro-benchmark

• Proportional time within Nxtval increases with more participating processes.

• When the arrival rate exceeds the processing rate, process hosting the counter must utilize buffer and flow control.

Page 17: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

Nxtval Performance Experiments

Strong Scaling

• 10 water molecules, (aDZ)• 14 water molecules, (aDZ)

• 8 processes per node

• Percentage of overall execution time within Nxtval increases with scaling.

Page 18: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

Inspector/Executor Design1. Inspector

• Calculate memory requirements• Remove null tasks• Collate task-list

2. Task Cost Estimator• Two options:

• Use performance models • Load gettimeofday() measurement from previous iteration(s)

• Deduce performance models off-line

3. Static Partitioner• Partition into N groups where N is the number of MPI

processes• Minimize load balance according to cost estimations• Write task list information for each proc/contraction to

volatile memory

4. Executor• Launch all tasks

Page 19: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

Performance Modeling - DGEMMDGEMM:

• A(m,k), B(k,n), and C(m,n) are 2D matrices• α and β are scalar coefficients

Our Performance Model:

• (mn) dot products of length k• Corresponding (mn) store operations in C• m loads of size k from A• n loads of size k from B• a, b, c, and d are found by solving a nonlinear least squares problem (in Matlab)

Page 20: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

Performance Modeling - DGEMMDGEMM:

• A(m,k), B(k,n), and C(m,n) are 2D matrices• α and β are scalar coefficients

Our Performance Model:

• (mn) dot products of length k• Corresponding (mn) store operations in C• m loads of size k from A• n loads of size k from B• a, b, c, and d are found by solving a nonlinear least squares problem (in Matlab)

Page 21: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

Performance Modeling – TCE “Sort”

Our Performance Model:

• TCE “Sorts” are actually matrix permutations

• 3rd order polynomial fit suffices

• Data always fits in L2 cache for this architecture

• Somewhat noisy measurements, but that’s OK.

(bytes)

Page 22: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

Largest Processing Time (LPT) Algorithm

1. Sort tasks by cost in descending order

2. Assign to least loaded process so far

*SIAM Journal on Applied Mathematics, Vol. 17, No. 2. (Mar., 1969), pp. 416-429.

• Polynomial time algorithm applied to an NP-Hard problem

• Proven “4/3 approximate” by Richard Graham*

Page 23: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

1. Sort tasks by cost in descending order

2. Assign to least loaded process so far

*SIAM Journal on Applied Mathematics, Vol. 17, No. 2. (Mar., 1969), pp. 416-429.

• Polynomial time algorithm applied to an NP-Hard problem

• Proven “4/3 approximate” by Richard Graham*

Largest Processing Time (LPT) Algorithm

Page 24: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

LPT - Binary Min Heap

1. Initialize a heap with N nodes (N = # of procs) each having zero cost.

2. Perform IncreaseMin() operation for each new cost from the sorted list of tasks.

• IncreaseMin() is quite efficient because UpdateRoot() often occurs in O(1) time.

• Far more efficient than the naïve approach of iterating through an array to find the min.

• Execution time for this phase is negligible.

Page 25: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

LPT - Load Balance

(a) Original with Nxtval Measured

(b) Inspector/Executor with Nxtval Measured

(c) LPT – 1st iteration

(d) LPT – subsequent iterations

Page 26: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

Dynamic Buckets Design

Page 27: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

Dynamic Buckets Implementation

Page 28: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

Dynamic Buckets Load Balance

(a) LPT Predicted

(b) LPT Measured

(c) Dynamic Buckets Predicted

d) Dynamic Buckets Measured

Page 29: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

I/E ResultsNitrogen - CCSDT Benzene - CCSD

Page 30: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

10-H2O Cluster Results (DB)CCSD_t2_7_3 CCSD_t2_7

Page 31: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

Conclusions

1. Nxtval can be expensive at large scales2. Static Partitioning can fix the problem, but has

weaknesses:• Requires performance model• Noise degrades results

3. Dynamic Buckets is a viable alternative, and requires few changes to GA applications.

4. Solving load balance issues differs from problem to problem – work needs to be done to pinpoint why and what to do about it.

Page 32: Inspector-Executor Load Balancing Algorithms for Block-Sparse  Tensor Contractions

Future Work (Research)

1. Cyclops Tensor Framework (CTF)2. DAG Scheduling of tensor contractions3. What happens with accelerators (MIC/GPU)?

1. Performance model2. Balancing load across both CPU and device

4. Comparison with hierarchical distributed load balancing, work stealing, etc.

5. Hypergraph partitioning / data locality