classical algebraic multigrid for cfd using cuda · - “scaling hypre’s multigrid solvers to...

31
Classical Algebraic Multigrid for CFD using CUDA Simon Layton, Lorena Barba With thanks to Justin Luitjens, Jonathan Cohen (NVIDIA)

Upload: others

Post on 19-Oct-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Classical Algebraic Multigrid for CFD using CUDA

Simon Layton, Lorena Barba

With thanks to Justin Luitjens, Jonathan Cohen (NVIDIA)

Page 2: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Problem Statement

‣Solve elliptical problems with the form

‣For Instance the Pressure Poisson equation from Fluids

- This commonly takes 90%+ of total run time

‣These systems arise in many other engineering applications2

r2� = r · u⇤

Au = b

Page 3: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

What is Multigrid?

‣Hierarchical

- Repeatedly reduce the size of the problem

- Solve these smaller problems & transfer corrections

‣ Iterative

- Based on a smoother (i.e. Jacobi, Gauss-Seidel)

- Solve using recursive ‘cycles’

3

Page 4: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Why do we care?

‣Optimal complexity

- Convergence independent of size of problem

- Consequently O(N) in time

‣Parallel & scalable

- Hypre package from LLNL scales to 100k+ cores

- “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al.

4

Page 5: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Classical vs. Aggregative AMG

‣Solve phase is the same - only difference in setup

- Classical selects variables to be in the coarser level

- Aggregation combines variables to form the coarser level

5

Classical:Choose variable to

keep on coarse level

Aggregation:Combine variables

to form entry oncoarse level

Page 6: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

AMG - Main Components

‣ Consists of 5 main parts

1. Strength of Connection metric (SoC)- Measure of how strongly matrix variables affect each other

2. Selector / Aggregator- Choose variables / aggregates to move to coarser level

3. Interpolator / Restrictor- To generate coarsened matrices & transfer residuals between levels

4. Galerkin Product- Generate next coarsest level: Ak+1 = RkAkPk

5. Solver cycle- Consists of Sparse Matrix-Vector multiplications

6

Page 7: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Strength of Connection

‣ Variable i strongly depends on variable j if:

‣ After max value calculated, strength calculation for each edge can be performed in parallel

7

4

�1

�0.1

�0.1

�0.3�0.5�0.8

�1

�aij � ↵max

k 6=i(�aik)

In this case: = 0.25Threshold = -0.25↵

Page 8: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

PMIS Selector

‣Select some variables to be coarse

- Use strongest connection

- Neighbours of these variables are now fine

‣Repeat until all variables are classified

8

Initial graph

Page 9: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

PMIS Selector

9

Select some variablesto be coarse

‣Select some variables to be coarse

- Use strongest connection

- Neighbours of these variables are now fine

‣Repeat until all variables are classified

Page 10: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

PMIS Selector

10

Neighbours of thesecoarse variablesmarked as fine

‣Select some variables to be coarse

- Use strongest connection

- Neighbours of these variables are now fine

‣Repeat until all variables are classified

Page 11: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Interpolator

‣Extended+i interpolator used

- Needed to ensure good convergence with our selector

- Considers neighbours of neighbours for interpolation set

11

Distance 1

Page 12: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Interpolator

‣Extended+i interpolator used

- Needed to ensure good convergence with our selector

- Considers neighbours of neighbours for interpolation set

12

Distance 2

Page 13: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Solve Cycle

‣Described recursively

- Simplest is V-cycle

13

if (level = M (coarsest))solve AMuM = fM

elseApply smoother n1 times to Akuk = fk

Perform coarse grid correctionSet rk = fk - Akuk

Set rk+1 = Rkrk

Call cycle recursively for next levelInterpolate ek = Pkek+1

Correct solution: uk = uk + ek

Apply smoother n2 times to Akuk = fk

k=1

k=MRestriction Interpolation

Page 14: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

GPU Implementation - Initial Thoughts

‣Most operations seem to easily expose parallelism:

- Strength of connection

- PMIS selector

- Solve cycle - SpMV from cusp / cusparse

‣Others less obvious or trickier:

- Galerkin Product

- Interpolator

‣Want to ensure correctness

- Compare against identical algorithms in Hypre 14

Page 15: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Initial Results - Convergence

15

‣Regular Poisson grids

- Demonstrates problem size invariance for convergence rate

0 2 4 6 8 10x 105

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1A

ver

age

Co

nv

erg

ence

Rat

e

N

5!pt9pt7pt27pt

27pt stencil runs out of memory on C2050 - we expect to get the same behaviour as other stencils

Page 16: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Initial Results - Immersed Boundary Method

16

‣Speed compared to 1 node using Hypre (unoptimized)

-1 0 1 2 3 4-2

-1.5-1

-0.5 0

0.5 1

1.5 2 Code Time Speedup

CUDA 3.57s 1.87x

Hypre(1C)

6.69s -

Hypre(2C)

5.29s 1.26x

Hypre(4C)

4.16s 1.61x

Hypre(6C)

4.00s 1.67x

Page 17: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Where to go from here?

‣Breakdown 5pt Poisson stencil by routine (easy case)

17

−1.5 −1 −0.5 0 0.5 1 1.5−1.5

−1

−0.5

0

0.5

1

1.5

2%2%< 1%

39%

44%

13%< 1%

Prolongate & correctRestrict residualCompute RCompute ACompute PSmoothOther

Compute A is Galerkin product - Sparse R*A*P

Interpolation matrix:Can we form this as an existing pattern?

Page 18: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Compute P - Interpolation Matrix

‣ Interpolation weights depend on the distance 2 set:

‣This repeated set union turns out to be fairly tricky..

- Can be formed as a specialized SpMM

18

Ci = Csi [

[

j2F si

Csj

Page 19: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

SpMM / Repeated set union

‣SpMM conceptually:

- For each output row, if influencing column already in row, accumulate to current value, else add column to row

‣Set union conceptually:

- For each set of coarse connections, if entry already added to result set, ignore, else add to set.

19

Ci = Csi [

[

j2F si

Csj

C = (I + F s) · Cs

Page 20: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

SpMM: Idea

‣Similar in more detail later by Julien Demouth (S0285)

‣ Implement Accumulator as hash table in shared memory

- Work on as many rows per block as possible that fit in sMem

20

For  each  row  i  in  matrix  A:Accumulator  =  []For  each  entry  j  in  row  i:rowB  =  column(j)For  each  entry  k  on  rowB  of  B:val  =  value(j)*value(k)Accumulator.add(column(k),val)

Accumulator.write

Warp

Thread

Page 21: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Problem!

‣Performance highly dependent on size of hash tables

- Not enough sMem = cannot solve problem

- Too many values in sMem = many collisions = poor perf.

‣Answer? Estimate storage needed

- Allocate #warps per block & size of shared memory based on this estimation

- If all else fails, pass multiplication to cusp

‣Worst case almost identical performance

21

Page 22: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Storage Estimation

‣From Edith Cohen: ‘Structure prediction and computation of sparse matrix products’ 1998

‣Represent as bipartite graph (2x2 grid, 5pt Poisson stencil)

‣By looking at connections, get estimation of non-zeros in C 22

A.rows A.colsB.rows

B.colsC.rows

0

1

2

3

Page 23: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Storage Estimation

‣From Edith Cohen: ‘Structure prediction and computation of sparse matrix products’ 1998

‣Represent as bipartite graph (2x2 grid, 5pt Poisson stencil)

‣By looking at connections, get estimation of non-zeros in C 22

A.rows A.colsB.rows

B.colsC.rows

0

1

2

3

Page 24: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Storage Estimation

‣From Edith Cohen: ‘Structure prediction and computation of sparse matrix products’ 1998

‣Represent as bipartite graph (2x2 grid, 5pt Poisson stencil)

‣By looking at connections, get estimation of non-zeros in C 22

A.rows A.colsB.rows

B.colsC.rows

0

1

2

3

Page 25: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Storage Estimation

‣From Edith Cohen: ‘Structure prediction and computation of sparse matrix products’ 1998

‣Represent as bipartite graph (2x2 grid, 5pt Poisson stencil)

‣By looking at connections, get estimation of non-zeros in C 22

A.rows A.colsB.rows

B.colsC.rows

0

1

2

3

Page 26: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Storage Estimation

‣From Edith Cohen: ‘Structure prediction and computation of sparse matrix products’ 1998

‣Represent as bipartite graph (2x2 grid, 5pt Poisson stencil)

‣By looking at connections, get estimation of non-zeros in C 22

A.rows A.colsB.rows

B.colsC.rows

0

1

2

3

Page 27: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Storage Estimation

‣From Edith Cohen: ‘Structure prediction and computation of sparse matrix products’ 1998

‣Represent as bipartite graph (2x2 grid, 5pt Poisson stencil)

‣By looking at connections, get estimation of non-zeros in C 22

A.rows A.colsB.rows

B.colsC.rows

0

1

2

3

Page 28: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Storage Estimation

‣From Edith Cohen: ‘Structure prediction and computation of sparse matrix products’ 1998

‣Represent as bipartite graph (2x2 grid, 5pt Poisson stencil)

‣By looking at connections, get estimation of non-zeros in C 22

A.rows A.colsB.rows

B.colsC.rows

0

1

2

3

Page 29: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Storage Prediction - Implementation

23

For  r  in  [0,R)For  each  row,  i:u[i]  =  rand()  //  exponential  dist.

For  each  row  in  A,  i:v[i]  =  +infinityFor  each  non-­‐zero  on  i,  j:v[i]  =  min(v[i],u[col[j]])

For  each  row  in  B,  iw[i][r]  =  +infinityFor  each  non-­‐zero  on  i,  j:w[i][r]  =  min(w[i],v[col[j]])

nnz[i]  =  ceil((R-­‐1)/sum(w[i][:]))  //  est

Page 30: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Storage Prediction - Optimizations

24

For  r  in  [0,R)For  each  row,  i:u[i]  =  rand()  //  exponential  dist.

For  each  row  in  A,  i:v[i]  =  FLT_MAXFor  each  non-­‐zero  on  i,  j:v[i]  =  min(v[i],hash(col[j]))

For  each  row  in  B,  iw[i][r]  =  FLT_MAXFor  each  non-­‐zero  on  i,  j:w[i][r]  =  min(w[i],v[col[j]])

nnz[i]  =  ceil((R-­‐1)/sum(w[i][:]))  //  estimator

Use pseudo-random hash

Generalized SpMVs

Only float precisionneeded

Page 31: Classical Algebraic Multigrid for CFD using CUDA · - “Scaling Hypre’s multigrid solvers to 100,000 cores”, AH Baker et. al. 4. Classical vs. Aggregative AMG ‣Solve phase

Conclusions

‣ Implemented full classical AMG algorithm in parallel

- Validated against OSS code, Hypre

- Tested with CFD problem

‣Attempted to solve bottleneck of Galerkin product & repeated set intersection

- Predict storage per row

- Use this to determine launch parameters for hash-based SpMM using shared memory

25