workq: a many-core producer/consumer execution model applied to pgas computations david ozog*, allen...
TRANSCRIPT
![Page 1: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/1.jpg)
WorkQ: A Many-Core Producer/Consumer Execution Model
Applied to PGAS Computations
David Ozog*, Allen Malony*, Jeff Hammond‡, Pavan Balaji†
* University of Oregon‡ Intel Corporation
† Argonne National Laboratory
ICPADS 2014 Hsinchu, Taiwan
December 18, 2014
![Page 2: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/2.jpg)
Motivation• Effectively dealing with irregularity in
highly parallel applications is difficult.• Sparsity and task variation are inherent
to many computational problems.• Load balancing is important, and
must be done in a way that preserves effective overlap of communication and computation.
• Simply using non-blocking communication calls is not enough for collections of highly irregular tasks.
![Page 3: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/3.jpg)
Motivation
}Overlap each execute_task() with the next get_task()
execute() get() program trace
time
![Page 4: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/4.jpg)
Motivation
}Overlap each execute_task() with the next get_task()
execute() get() program trace
time
![Page 5: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/5.jpg)
NWChem and Coupled Cluster
Coupled Cluster (CC):
• Ab initio - Highly accurate
• Accuracy hierarchy: CCSD < CCSD(T) < CCSDT < CCSDTQ
• Computational/Memory scaling:)()()()( 6644 nOnOnOnO
)()()()( 9876 nOnOnOnO
![Page 6: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/6.jpg)
NWChem and Coupled Cluster
)()()()( 6644 nOnOnOnO
)()()()( 9876 nOnOnOnO
![Page 7: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/7.jpg)
Sparsity and Load Imbalance
*=
*=
TILING
*=
?
= *
• Load balance is crucially important for performance
• Obtaining optimal load balance is NP-Hard
![Page 8: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/8.jpg)
Previous Work:
Inspector/Executor Load Balancing
![Page 9: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/9.jpg)
I/E Static Partitioning Design
1. Inspector• Calculate memory requirements• Detect null tasks• Collate task-list
2. Task Cost Estimator• Two options:
• Use performance models • Timers from previous iteration(s)
3. Static Partitioner• Partition into N groups where N is the
number of MPI processes• Minimize load balance according to cost
estimations• Write task list information for each process to
volatile memory
4. Executor• Launch all tasks
![Page 10: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/10.jpg)
I/E ResultsNitrogen Benzene
10 water molecules
![Page 11: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/11.jpg)
I/E ResultsNitrogen Benzene
10 water molecules
![Page 12: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/12.jpg)
Design and Implementation:
WorkQ Execution Model
![Page 13: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/13.jpg)
Original Execution
![Page 14: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/14.jpg)
Original Execution
![Page 15: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/15.jpg)
Original Execution
![Page 16: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/16.jpg)
Original Execution
![Page 17: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/17.jpg)
Original Execution
![Page 18: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/18.jpg)
Original Execution
![Page 19: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/19.jpg)
Original Execution
![Page 20: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/20.jpg)
Original Execution
![Page 21: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/21.jpg)
Original Execution
![Page 22: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/22.jpg)
Original Execution
![Page 23: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/23.jpg)
WorkQ Execution
![Page 24: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/24.jpg)
WorkQ Execution
![Page 25: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/25.jpg)
WorkQ Execution
![Page 26: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/26.jpg)
WorkQ Execution
![Page 27: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/27.jpg)
WorkQ Execution
![Page 28: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/28.jpg)
WorkQ Execution
![Page 29: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/29.jpg)
WorkQ Execution
![Page 30: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/30.jpg)
WorkQ Execution
![Page 31: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/31.jpg)
WorkQ Execution
![Page 32: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/32.jpg)
WorkQ Execution
![Page 33: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/33.jpg)
WorkQ Execution
![Page 34: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/34.jpg)
WorkQ Execution
![Page 35: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/35.jpg)
WorkQ Execution
![Page 36: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/36.jpg)
WorkQ Execution
![Page 37: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/37.jpg)
WorkQ Execution
![Page 38: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/38.jpg)
WorkQ Execution
![Page 39: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/39.jpg)
WorkQ Execution
![Page 40: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/40.jpg)
WorkQ Execution
![Page 41: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/41.jpg)
WorkQ Execution
![Page 42: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/42.jpg)
WorkQ Execution
![Page 43: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/43.jpg)
WorkQ Execution
![Page 44: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/44.jpg)
WorkQ Execution
![Page 45: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/45.jpg)
WorkQ Execution
![Page 46: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/46.jpg)
WorkQ Execution
![Page 47: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/47.jpg)
WorkQ Execution
![Page 48: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/48.jpg)
WorkQ Execution
![Page 49: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/49.jpg)
WorkQ Algorithm
![Page 50: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/50.jpg)
WorkQ Library APICourier:
• workq_create_queue()• workq_alloc_task()• workq_append_task()• workq_enqueue()
Worker:• workq_dequeue()• workq_get_next()• workq_execute_task()
Finalization:• workq_free_shm()• workq_destroy()
![Page 51: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/51.jpg)
WorkQ Results
ARMCI_Rmw() ARMCI_GetS()Misc. ComputationDGEMM
Original Execution
WorkQ Execution
(NxtVal)(Get)(memory -intensive)(flop-intensive)
time
{{
Time:
22.1 s
11.5 s
![Page 52: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/52.jpg)
WorkQ Results
(ACISS Cluster at UOregon):
![Page 53: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/53.jpg)
WorkQ Mini-app Weak Scaling
ACISS Cluster (UOregon):
• 2x Intel X5650• 2.67 GHz 6-core CPUs• 12 cores per node• 72 GB RAM per node
• Ethernet interconnect
![Page 54: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/54.jpg)
WorkQ Mini-app Weak Scaling
Blues Cluster (Argonne):
• 2x Intel X5550• 2.67 GHz 4-core CPUs• 8 cores per node• 24 GB RAM per node
• InfiniBand QDR interconnect
![Page 55: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/55.jpg)
WorkQ w/ NWChem
ExperimentConfiguration:
• ACISS cluster• 3 water molecules• aug-cc-pVDZ basis
Top:• 384 MPI processes
Bottom:• 192 MPI processes
![Page 56: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/56.jpg)
WorkQ w/ NWChemExperimentConfiguration:• Carver cluster• InfiniBand network• 5 water molecules• aug-cc-pVDZ basis• 384 MPI processes
![Page 57: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/57.jpg)
Conclusions• The get/compute/put model suffers from
unnecessary wait times.• Using non-blocking communication may not
achieve optimal overlap with irregular workloads.• Opportunities exist for exploiting
communication/computation overlap via a more dynamic and adaptive runtime execution model.
• Future work will involve integration of I/E load balancing and WorkQ optimizations, runtime parameter auto-tuning, and exploration on heterogeneous systems.
![Page 58: WorkQ: A Many-Core Producer/Consumer Execution Model Applied to PGAS Computations David Ozog*, Allen Malony*, Jeff Hammond ‡, Pavan Balaji † * University](https://reader037.vdocuments.mx/reader037/viewer/2022110400/56649db55503460f94aa6ac6/html5/thumbnails/58.jpg)
References• M. Valiev, E.J. Bylaska, N. Govind, K. Kowalski, T.P. Straatsma, H.J.J. Van Dam, D.
Wang, J. Nieplocha, E. Apra, T.L. Windus, W.A. de Jong, “NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations”, Computer Physics Communications, Volume 181, Issue 9, September 2010, 1477–89.
• So Hirata, “Tensor Contraction Engine: Abstraction and Automated Parallel Implementation of Configuration-Interaction, Coupled-Cluster, and Many-Body Perturbation Theories”, The Journal of Physical Chemistry A 2003 107 (46), 9887-9897.
• J. Nieplocha, R.J. Harrison, and R.J. Littlefield. Global Arrays:A Nonuniform Memory Access Programming Model for HighPerformance Computers. The Journal of Supercomputing, 10(2):169–189, 1996.
• Jarek Nieplocha and Bryan Carpenter. ARMCI: A Portable RemoteMemory Copy Library for Distributed Array Libraries and CompilerRun-time Systems. In Parallel and Distributed Processing, volume1586 of Lecture Notes in Computer Science, pages 533–546. SpringerBerlin Heidelberg, 1999.
• David Ozog, Jeff Hammond, James Dinan, Pavan Balaji, Sameer Shende, Allen Malony: Inspector-Executor Load Balancing Algorithms for Block-Sparse Tensor Contractions. ICPP 2013.