parallel fmm - university of chicagopeople.cs.uchicago.edu/~knepley/presentations/sameh10.pdfin...
TRANSCRIPT
![Page 1: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/1.jpg)
Parallel FMM
Matthew Knepley
Computation InstituteUniversity of Chicago
Department of Molecular Biology and PhysiologyRush University Medical Center
Conference on High Performance Scientific ComputingIn Honor of Ahmed Sameh’s 70th Birthday
Purdue University, October 11, 2010
M. Knepley (UC) SC Sameh ’10 1 / 1
![Page 2: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/2.jpg)
Main Point
Using estimates and proofs,
a simple software architecture,
gets good scaling, efficiency,and adaptive load balance.
M. Knepley (UC) SC Sameh ’10 2 / 1
![Page 3: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/3.jpg)
Main Point
Using estimates and proofs,
a simple software architecture,
gets good scaling, efficiency,and adaptive load balance.
M. Knepley (UC) SC Sameh ’10 2 / 1
![Page 4: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/4.jpg)
Main Point
Using estimates and proofs,
a simple software architecture,
gets good scaling, efficiency,and adaptive load balance.
M. Knepley (UC) SC Sameh ’10 2 / 1
![Page 5: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/5.jpg)
Collaborators
The PetFMM team:
Prof. Lorena BarbaDept. of Mechanical Engineering, Boston University
Dr. Felipe Cruz, developer of GPU extensionNagasaki Advanced Computing Center, Nagasaki University
Dr. Rio Yokota, developer of 3D extensionDept. of Mechanical Engineering, Boston University
M. Knepley (UC) SC Sameh ’10 3 / 1
![Page 6: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/6.jpg)
Collaborators
Chicago Automated Scientific Computing Group:
Prof. Ridgway ScottDept. of Computer Science, University of ChicagoDept. of Mathematics, University of Chicago
Peter Brune, (biological DFT)Dept. of Computer Science, University of Chicago
Dr. Andy Terrel, (Rheagen)Dept. of Computer Science and TACC, University of Texas at Austin
M. Knepley (UC) SC Sameh ’10 4 / 1
![Page 7: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/7.jpg)
Complementary Work
Outline
M. Knepley (UC) SC Sameh ’10 5 / 1
![Page 8: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/8.jpg)
Complementary Work
FMM Work
Queue-based hybrid executionOpenMP for multicore processors
CUDA for GPUs
Adaptive hybrid Treecode-FMMTreecode competitive only for very low accuracy
Very high flop rates for treecode M2P operation
Computation/Communication Overlap FMMProvably scalable formulation
Overlap P2P with M2L
M. Knepley (UC) SC Sameh ’10 6 / 1
![Page 9: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/9.jpg)
Short Introduction to FMM
Outline
M. Knepley (UC) SC Sameh ’10 7 / 1
![Page 10: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/10.jpg)
Short Introduction to FMM
FMM Applications
FMM can accelerate both integral and boundary element methods for:LaplaceStokesElasticity
AdvantagesMesh-freeO(N) timeDistributed and multicore (GPU) parallelismSmall memory bandwidth requirement
M. Knepley (UC) SC Sameh ’10 8 / 1
![Page 11: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/11.jpg)
Short Introduction to FMM
FMM Applications
FMM can accelerate both integral and boundary element methods for:LaplaceStokesElasticity
AdvantagesMesh-freeO(N) timeDistributed and multicore (GPU) parallelismSmall memory bandwidth requirement
M. Knepley (UC) SC Sameh ’10 8 / 1
![Page 12: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/12.jpg)
Short Introduction to FMM
Fast Multipole Method
FMM accelerates the calculation of the function:
Φ(xi) =∑
j
K (xi , xj)q(xj) (1)
Accelerates O(N2) to O(N) time
The kernel K (xi , xj) must decay quickly from (xi , xi)
Can be singular on the diagonal (Calderón-Zygmund operator)
Discovered by Leslie Greengard and Vladimir Rohklin in 1987
Very similar to recent wavelet techniques
M. Knepley (UC) SC Sameh ’10 9 / 1
![Page 13: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/13.jpg)
Short Introduction to FMM
Fast Multipole Method
FMM accelerates the calculation of the function:
Φ(xi) =∑
j
qj
|xi − xj |(1)
Accelerates O(N2) to O(N) time
The kernel K (xi , xj) must decay quickly from (xi , xi)
Can be singular on the diagonal (Calderón-Zygmund operator)
Discovered by Leslie Greengard and Vladimir Rohklin in 1987
Very similar to recent wavelet techniques
M. Knepley (UC) SC Sameh ’10 9 / 1
![Page 14: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/14.jpg)
Short Introduction to FMM
Spatial Decomposition
Pairs of boxes are divided into near and far :
Neighbors are treated as very near.
M. Knepley (UC) SC Sameh ’10 10 / 1
![Page 15: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/15.jpg)
Short Introduction to FMM
Spatial Decomposition
Pairs of boxes are divided into near and far :
Neighbors are treated as very near.
M. Knepley (UC) SC Sameh ’10 10 / 1
![Page 16: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/16.jpg)
Short Introduction to FMM
Functional Decomposition
Downward SweepUpward Sweep
Create Multipole Expansions. Evaluate Local Expansions.
P2M M2M M2L L2L L2P
M. Knepley (UC) SC Sameh ’10 11 / 1
![Page 17: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/17.jpg)
Parallelism
Outline
M. Knepley (UC) SC Sameh ’10 12 / 1
![Page 18: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/18.jpg)
Parallelism
FMM in Sieve
The Quadtree is a Sievewith optimized operations
Multipoles are stored in Sections
Two Overlaps are definedNeighborsInteraction List
Completion moves data forNeighborsInteraction List
M. Knepley (UC) SC Sameh ’10 13 / 1
![Page 19: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/19.jpg)
Parallelism
FMM in Sieve
The Quadtree is a Sievewith optimized operations
Multipoles are stored in Sections
Two Overlaps are definedNeighborsInteraction List
Completion moves data forNeighborsInteraction List
M. Knepley (UC) SC Sameh ’10 13 / 1
![Page 20: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/20.jpg)
Parallelism
FMM in Sieve
The Quadtree is a Sievewith optimized operations
Multipoles are stored in Sections
Two Overlaps are definedNeighborsInteraction List
Completion moves data forNeighborsInteraction List
M. Knepley (UC) SC Sameh ’10 13 / 1
![Page 21: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/21.jpg)
Parallelism
FMM in Sieve
The Quadtree is a Sievewith optimized operations
Multipoles are stored in Sections
Two Overlaps are definedNeighborsInteraction List
Completion moves data forNeighborsInteraction List
M. Knepley (UC) SC Sameh ’10 13 / 1
![Page 22: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/22.jpg)
Parallelism
FMM in Sieve
The Quadtree is a Sievewith optimized operations
Multipoles are stored in Sections
Two Overlaps are definedNeighborsInteraction List
Completion moves data forNeighborsInteraction List
M. Knepley (UC) SC Sameh ’10 13 / 1
![Page 23: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/23.jpg)
Parallelism
FMM in Sieve
The Quadtree is a Sievewith optimized operations
Multipoles are stored in Sections
Two Overlaps are definedNeighborsInteraction List
Completion moves data forNeighborsInteraction List
M. Knepley (UC) SC Sameh ’10 13 / 1
![Page 24: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/24.jpg)
Parallelism
FMM in Sieve
The Quadtree is a Sievewith optimized operations
Multipoles are stored in Sections
Two Overlaps are definedNeighborsInteraction List
Completion moves data forNeighborsInteraction List
M. Knepley (UC) SC Sameh ’10 13 / 1
![Page 25: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/25.jpg)
Parallelism
FMM Control Flow
Downward SweepUpward Sweep
Create Multipole Expansions. Evaluate Local Expansions.
P2M M2M M2L L2L L2P
Kernel operations will map to GPU tasks.
M. Knepley (UC) SC Sameh ’10 14 / 1
![Page 26: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/26.jpg)
Parallelism
FMM Control FlowParallel Operation
M2M and L2L translations M2L transformation Local domain
Level k
Root tree
Sub-tree 1 Sub-tree 2 Sub-tree 3 Sub-tree 4 Sub-tree 5 Sub-tree 6 Sub-tree 7 Sub-tree 8
Kernel operations will map to GPU tasks.
M. Knepley (UC) SC Sameh ’10 14 / 1
![Page 27: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/27.jpg)
Parallelism
Parallel Tree Implementation
Divide tree into a root and local trees
Distribute local trees among processes
Provide communication pattern for local sections (overlap)Both neighbor and interaction list overlaps
Sieve generates MPI from high level description
M. Knepley (UC) SC Sameh ’10 15 / 1
![Page 28: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/28.jpg)
Parallelism
Parallel Tree ImplementationHow should we distribute trees?
Multiple local trees per process allows good load balancePartition weighted graph
Minimize load imbalance and communication
Computation estimate:Leaf Nip (P2M) + nIp2 (M2L) + Nip (L2P) + 3d N2
i (P2P)Interior ncp2 (M2M) + nIp2 (M2L) + ncp2 (L2L)
Communication estimate:Diagonal nc(L − k − 1)
Lateral 2d 2m(L−k−1)−12m−1 for incidence dimesion m
Leverage existing work on graph partitioningParMetis
M. Knepley (UC) SC Sameh ’10 16 / 1
![Page 29: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/29.jpg)
Parallelism
Parallel Tree ImplementationWhy should a good partition exist?
Shang-hua Teng, Provably good partitioning and load balancing algorithmsfor parallel adaptive N-body simulation, SIAM J. Sci. Comput., 19(2), 1998.
Good partitions exist for non-uniform distributions2D O
(√n(log n)3/2
)edgecut
3D O(n2/3(log n)4/3
)edgecut
As scalable as regular grids
As efficient as uniform distributions
ParMetis will find a nearly optimal partition
M. Knepley (UC) SC Sameh ’10 17 / 1
![Page 30: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/30.jpg)
Parallelism
Parallel Tree ImplementationWill ParMetis find it?
George Karypis and Vipin Kumar, Analysis of Multilevel Graph Partitioning,Supercomputing, 1995.
Good partitions exist for non-uniform distributions2D Ci = 1.24iC0 for random matching3D Ci = 1.21iC0?? for random matching
3D proof needs assurance that averge degree does not increase
Efficient in practice
M. Knepley (UC) SC Sameh ’10 18 / 1
![Page 31: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/31.jpg)
Parallelism
Parallel Tree ImplementationAdvantages
Simplicity
Complete serial code reuse
Provably good performance and scalability
M. Knepley (UC) SC Sameh ’10 19 / 1
![Page 32: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/32.jpg)
Parallelism
Parallel Tree ImplementationAdvantages
Simplicity
Complete serial code reuse
Provably good performance and scalability
M. Knepley (UC) SC Sameh ’10 19 / 1
![Page 33: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/33.jpg)
Parallelism
Parallel Tree ImplementationAdvantages
Simplicity
Complete serial code reuse
Provably good performance and scalability
M. Knepley (UC) SC Sameh ’10 19 / 1
![Page 34: Parallel FMM - University of Chicagopeople.cs.uchicago.edu/~knepley/presentations/Sameh10.pdfIn Honor of Ahmed Sameh’s 70th Birthday Purdue University, October 11, 2010 M. Knepley](https://reader033.vdocuments.mx/reader033/viewer/2022043003/5f842a6e0da338321e76f2aa/html5/thumbnails/34.jpg)
Parallelism
Distributing Local Trees
The interaction of locals trees is represented by a weighted graph.
cijwi
wj
This graph is partitioned, and trees assigned to processes.M. Knepley (UC) SC Sameh ’10 20 / 1