gpu-accelerated cfd simulations for turbomachinery …...aissa m. , verstraete ,t., and vuik, c.....
TRANSCRIPT
GPU-accelerated CFD Simulationsfor Turbomachinery Design Optimization
Mohamed H. Aissa
Co-promotor:Dr. Tom Verstraete
Promotor: Prof. C. Vuik
www.researchgate.net/profile/Mohamed_Aissa3
Can your simulation profit from the GPU?
• What is a GPU?• How fast it is?• How to use it?
Multi-core vs many-core
Processor
Control Unit
Memory
Processor
Control Unit
Memory
Processor
Control Unit
Memory
Processor
Control Unit
Memory
Communication Netwok
Processor
Control Unit
Memory
Processor ProcessorProcessor
2/35
Massive Parallel Systems (e.g. GPU) as a trade-off
ALU
Control Unit
Memory
ALUALU
ALUALU
ALUALU
ALUALU
ALUALU
ALUALU
ALUALU
ALUALU
ALU ALU ALU
Source: the guardian.com
3
How fast is it?
Performance Gain
LU QR
SpMV Ray tracing
FFT Image processing
How to use a GPU
Ease of
Use
Performance Gain
GPU libraires:cuFFT, cuBLAS…
Lattice Boltzmann
4
Airplanes are gettingmore efficient
Engine optimization is a main contributor
Topic of Interest
• Nblades= 15• Chord length fixed
• Casing fixture
TurboLab Stator (1/4)
60 mm d=10mmh=20mm
d=2mm6
Inlet P0: 102713.0 PaInlet T0: 294.314 K
Inlet whirl angle: 42°Inlet pitch angle: 0 °
TurboLab (2/4): Boundary conditions and summary
Objectives:• Lower axial deviation• Lower total pressure loss
9 kg/s
7
TurboLab (3/4): Parametrization21 Design variables
8
TurboLab (4/4):Optimization Results
1.7 %
IT074IND6
60%
0.17%
9Every point is a costly CFD optimization need for a HPC solution
How beneficial are GPUs, a quick literature check:
• Acceleration is case-dependent (from 1x to 1000x).
• Speedups are sometimes contradicting.
• Some publications are very critical to GPUs for scientific computations:– Lee et al “Debunking the 100x GPU vs. CPU myth”
– Vuduc et al. “On the limits of GPU acceleration”
10/35
Main objective: A more tangible GPU potential
CFD GPU solvers Classification of CFD operations
Summaryand Conclusions
Proof-of-concept: Optimization cases
* All icon in this document from Flaticon.com
Main objective: A more tangible GPU potential
CFD GPU solvers Classification of CFD operations
Summaryand Conclusions
Proof-of-concept: Optimization cases
Numerical Scheme:
Implicit Time Stepping
Explicit Time Stepping
12
Main objective: A more tangible GPU potential
CFD GPU solvers Classification of CFD operations
Summaryand Conclusions
Proof-of-concept: Optimization cases
• Explicit time integration
Explicit solver
• Application:Steady RANS simulation
• Solved Equations:RANS (SA Model)
• Discretization (2nd Order):Roe Scheme + Flux LimiterExplicit RK 4 Stage
• Mesh: Multi-Block, Structured
• Acceleration: • 2 level Multigrid• Implicit Residual Smoothing
13
Explicit solver
162
90
0
20
40
60
80
100
120
140
160
180
0 100 200 300 400 500
Spe
ed
up
(o
ver
1 c
ore
Xeo
n E
3)
N Cells in thousands
GTX980
K40
13
Convective Flux Evaluation (1/3)
14
Convective Flux (2/3):Thread mapping possibilities
• Face-wise is not thread-safe
18 /47
• Cell-based mapping thread safe but with redundancy
14
Convective Flux (2/3): Thread mapping possibilities
• Direction-based mapping thread safe and less redundancy
14
Convective Flux (2/3): Thread mapping possibilities
• Multicoloring (MC) Face-based mappingthread safe and No redundancy
14
Convective Flux (2/3): Thread mapping possibilities
Compute Residual at
face i
Read striped
N/2 thread
Storestriped
i+1i
MC: Run 1
Compute Residual at
face i
Read coalesced
N thread
Storecoalesced
i
Red: Run 1
Compute Residual at
face i+1
N thread
Compute Residual at
face i+1
N/2 threadMC: Run 2
i+2i+1Read striped
Storestriped 0
50
100
150
200
250
1 2 3 4 5 6 7 8Mem
ory
Ban
dw
idth
[G
B/s
]
Stride length (in 4 bytes)
Memory bandwidth for striped memory access on GTX780:
A[i*stride]=B[i*stride]+C[i*stride]
15/35
Convective Flux (3/3): Multicolred (MC) vs redundant (Red)
MC RED
Face fluxes per call N/2 2N
Total faces fluxes N 2N
Time per call [ms] 0,28 0,71
Total time [ms] 0,56 0,71
Operations ratio - 2x
Total Speedup 1,26x -
1,26x instead of 2x:cost of striped access
16
Convective Flux (3/3): Multicolred (MC) vs redundant (Red)
Convergence Acceleration on GPU (1/3)
• Explicit solver is well adapted to the GPU architecture• Flow convergence is slow (CFL limitation)• Need for convergence acceleration.
• convergence acceleration methods on the GPU?• Multigrid• Implicit residual smoothing
17
Convergence Acceleration on GPU (2/3): Multigrid is also fast on the GPU
• Solve on fine grid• Interpolate solution and residual to coarse grid
CPU
GPU1,81
1,581,35
0
0,5
1
1,5
2
0 100 200 300 400 500
T 2xG
rid
s/T 1
Gri
d
N Cells in thousands
Cost of a 2-Grid schemeconverging to ideal cost of 1,125
• Solve on coase grid assisted by fine residual• Prolongate coarse correction to fine grid
1,125
18
Convergence Acceleration on GPU (3/3): Implicit Residual Smoothing on GPU
• Higher CFL Oscillation in the solution.
• A smoother residual reduces the oscillation ->Higher CFLs.
• Smoothing: diffusion equation solve a tridiagonal system.
i: 0 -> Ni
j: 0 -> N
j
• Higher CFL Oscillation in the solution.
• A smoother residual reduces the oscillation ->Higher CFLs.
• Smoothing: diffusion equation solve a tridiagonal system.
Interesting envelop(CFL increase with IRS x2-x3)
i: 0 -> Ni
j: 0 -> N
j
19
Convergence Acceleration on GPU (3/3): Implicit Residual Smoothing on GPU
Main objective: A more tangible GPU potential
CFD GPU solvers• Explicit time integration• Implicit time integration
Classification of CFD operations
Summaryand Conclusions
Proof-of-concept: Optimization cases
Implicit Time Stepping is more Stable but …
20
GMRES + Preconditioner
21/35
ILU is costly on GPU
• ILU-GMRES: Small gainon every iteration but ILU setup is slow:
• MCILU-GMRES: Multi-colored ILU fast only for small problems.
0
1
2
3
4
5
6
7
8
9
10
264k 1058k 2249k 4631k
Spee
du
p
Nrows
Speedup Krylov Speedup ILU
0
1
2
3
4
5
6
7
8
9
264k 1058k 2249k 4631kSp
eed
up
Nrows
Speedup Krylov Speedup MCILU
Why not Jacobi PC
0
20
40
60
80
100
120
140
160
180
264k 1058k 2249k 4631k
Spee
du
p
Nrows
Speedup Krylov Speedup Jacobi
• Jacobi-GMRES: very fastbut stable only for small time steps
• Jacobi-GMRES: Speedup decreases for higher CFLs
23
56023412 3586
1178
16568
579
16691
591
CPU ILU GPU ILU CPU OD-ILU GPU OD-ILU
Solve [s] Assemble[s]
On-demand factorization
5.55x11.46x
24
Main objective: A more tangible GPU potential
CFD GPU solvers Classification of CFD operations
Summaryand Conclusions
Proof-of-concept: Optimization cases
Speedup 1000x
Speedup 10x
Actually GPU is slower
Picture modified from :https://pixabay.com/en/blind-men-elephant-story-feel-see-1458438/
Classification (1/2):GPUs controversy
• GPU thousands of lightweight cores.
• Explicit solver: 10x to 100x speedup.
• Implicit solver: 1x to 10x speedup
We need a classification
25
Speedup 1000x
Speedup 10x
Actually GPU is slower
Picture modified from :https://pixabay.com/en/blind-men-elephant-story-feel-see-1458438/
25
Classification (1/2):GPUs controversy
38
Full article: Aissa M. , Verstraete ,T., and Vuik, c.. "Toward a GPU-aware comparison of explicit and implicit CFD simulations on structured meshes." Computers & Mathematics with Applications 74.1 (2017): 201-217.
Speedup 1000x
Speedup 10x
Actually GPU is slower
25
Classification (2/2):CFD operations
Performance Comparison: Explicit/Implicit
136x
7x
3x
Ref
GPU Explicit
GPU Implicit
CPU Implicit
CPU Explicit
26
Performance Comparison: Explicit/Implicit
0,001
0,01
0,1
1
10
100
1000
0,1 1 10 100 1000
Implicit CPU Explicit CPU implicit GPU explicit GPU
136x
Normalized wall time
27
Main objective: A more tangible GPU potential
CFD GPU solvers Classification of CFD operations
Summaryand Conclusions
Proof-of-concept: Optimization cases
Rc=14 (low CFL for implicit =15)
Example of a stator Optimization
0,001
0,01
0,1
1
10
100
1000
0,1 1 10 100 1000
Implicit CPU Explicit CPU
implicit GPU explicit GPU
28/35
Example of a stator Optimization
29
Example of a stator Optimization
LS82 cascade
Rc=457 (Explicit solver bad flow convergence)
0,001
0,01
0,1
1
10
100
1000
0,1 1 10 100 1000
Implicit CPU Explicit CPU
implicit GPU explicit GPU
31
LS82 cascade: Results
32
LS82 cascade: Optimized blade
Main objective: A more tangible GPU potential
CFD GPU solvers Classification of CFD operations
Summaryand Conclusions
Proof-of-concept: Optimization cases
Summary
Choice Explicit/Implicit: Convergence ratio is decisive.
Explicit RANS:100x-180x speedup.
Implicit RANS: 10x-20x speedup(due to slow preconditioning.
On-demand preconditioning:x3 faster but GPU-friendlierpreconditioner is needed.
The classification: an operation-specific acceleration offers more insights.
34
Can your simulation profit from the GPU?
• Where you situate your algorithm (slide 4: QR to ray-tracing)?• Do you need double precision (for half-precision FPGA is faster)?• ready to code (otherwise openACC is easier to use)?• Anyone provided a classification for operation used in your field?
35
Thanks for your attention
Dr. Mohamed H. Aissa
Turbomachinery & Propulsion Department
Email: [email protected]
www.researchgate.net/profile/Mohamed_Aissa3