gpu-accelerated cfd simulations for turbomachinery …...aissa m. , verstraete ,t., and vuik, c.....

GPU-accelerated CFD Simulationsfor Turbomachinery Design Optimization

Mohamed H. Aissa

Co-promotor:Dr. Tom Verstraete

Promotor: Prof. C. Vuik

www.researchgate.net/profile/Mohamed_Aissa3

https://repository.tudelft.nl/islandora/object/uuid%3A1fcc6ab4-daf5-416d-819a-2a7b0594c369

https://repository.tudelft.nl/islandora/object/uuid%3A1fcc6ab4-daf5-416d-819a-2a7b0594c369

http://www.researchgate.net/profile/Mohamed_Aissa3

Can your simulation profit from the GPU?

• What is a GPU?• How fast it is?• How to use it?

Multi-core vs many-core

Processor

Control Unit

Memory

Processor

Control Unit

Memory

Processor

Control Unit

Memory

Processor

Control Unit

Memory

Communication Netwok

Processor

Control Unit

Memory

Processor ProcessorProcessor

2/35

Massive Parallel Systems (e.g. GPU) as a trade-off

ALU

Control Unit

Memory

ALUALU

ALUALU

ALUALU

ALUALU

ALUALU

ALUALU

ALUALU

ALUALU

ALU ALU ALU

Source: the guardian.com

3

How fast is it?

Performance Gain

LU QR

SpMV Ray tracing

FFT Image processing

How to use a GPU

Ease of

Use

Performance Gain

GPU libraires:cuFFT, cuBLAS…

Lattice Boltzmann

4

Airplanes are gettingmore efficient

Engine optimization is a main contributor

Topic of Interest

• Nblades= 15• Chord length fixed

• Casing fixture

TurboLab Stator (1/4)

60 mm d=10mmh=20mm

d=2mm6

Inlet P0: 102713.0 PaInlet T0: 294.314 K

Inlet whirl angle: 42°Inlet pitch angle: 0 °

TurboLab (2/4): Boundary conditions and summary

Objectives:• Lower axial deviation• Lower total pressure loss

9 kg/s

7

TurboLab (3/4): Parametrization21 Design variables

8

TurboLab (4/4):Optimization Results

1.7 %

IT074IND6

60%

0.17%

9Every point is a costly CFD optimization need for a HPC solution

How beneficial are GPUs, a quick literature check:

• Acceleration is case-dependent (from 1x to 1000x).

• Speedups are sometimes contradicting.

• Some publications are very critical to GPUs for scientific computations:– Lee et al “Debunking the 100x GPU vs. CPU myth”

– Vuduc et al. “On the limits of GPU acceleration”

10/35

Main objective: A more tangible GPU potential

CFD GPU solvers Classification of CFD operations

Summaryand Conclusions

Proof-of-concept: Optimization cases

* All icon in this document from Flaticon.com

Numerical Scheme:

Implicit Time Stepping

Explicit Time Stepping

12





• Explicit time integration

Explicit solver

• Application:Steady RANS simulation

• Solved Equations:RANS (SA Model)

• Discretization (2nd Order):Roe Scheme + Flux LimiterExplicit RK 4 Stage

• Mesh: Multi-Block, Structured

• Acceleration: • 2 level Multigrid• Implicit Residual Smoothing

13

Explicit solver

162

90

0

20

40

60

80

100

120

140

160

180

0 100 200 300 400 500

Spe

ed

up

(o

ver

1 c

ore

Xeo

n E

3)

N Cells in thousands

GTX980

K40

13

Convective Flux Evaluation (1/3)

14

Convective Flux (2/3):Thread mapping possibilities

• Face-wise is not thread-safe

18 /47

• Cell-based mapping thread safe but with redundancy

14

Convective Flux (2/3): Thread mapping possibilities

• Direction-based mapping thread safe and less redundancy

14


• Multicoloring (MC) Face-based mappingthread safe and No redundancy

14


Compute Residual at

face i

Read striped

N/2 thread

Storestriped

i+1i

MC: Run 1

Compute Residual at

face i

Read coalesced

N thread

Storecoalesced

i

Red: Run 1

Compute Residual at

face i+1

N thread

Compute Residual at

face i+1

N/2 threadMC: Run 2

i+2i+1Read striped

Storestriped 0

50

100

150

200

250

1 2 3 4 5 6 7 8Mem

ory

Ban

dw

idth

[G

B/s

]

Stride length (in 4 bytes)

Memory bandwidth for striped memory access on GTX780:

A[i*stride]=B[i*stride]+C[i*stride]

15/35

Convective Flux (3/3): Multicolred (MC) vs redundant (Red)

MC RED

Face fluxes per call N/2 2N

Total faces fluxes N 2N

Time per call [ms] 0,28 0,71

Total time [ms] 0,56 0,71

Operations ratio - 2x

Total Speedup 1,26x -

1,26x instead of 2x:cost of striped access

16

Convective Flux (3/3): Multicolred (MC) vs redundant (Red)

Convergence Acceleration on GPU (1/3)

• Explicit solver is well adapted to the GPU architecture• Flow convergence is slow (CFL limitation)• Need for convergence acceleration.

• convergence acceleration methods on the GPU?• Multigrid• Implicit residual smoothing

17

Convergence Acceleration on GPU (2/3): Multigrid is also fast on the GPU

• Solve on fine grid• Interpolate solution and residual to coarse grid

CPU

GPU1,81

1,581,35

0

0,5

1

1,5

2

0 100 200 300 400 500

T 2xG

rid

s/T 1

Gri

d

N Cells in thousands

Cost of a 2-Grid schemeconverging to ideal cost of 1,125

• Solve on coase grid assisted by fine residual• Prolongate coarse correction to fine grid

1,125

18

Convergence Acceleration on GPU (3/3): Implicit Residual Smoothing on GPU

• Higher CFL Oscillation in the solution.

• A smoother residual reduces the oscillation ->Higher CFLs.

• Smoothing: diffusion equation solve a tridiagonal system.

i: 0 -> Ni

j: 0 -> N

j

• Higher CFL Oscillation in the solution.

• A smoother residual reduces the oscillation ->Higher CFLs.

• Smoothing: diffusion equation solve a tridiagonal system.

Interesting envelop(CFL increase with IRS x2-x3)

i: 0 -> Ni

j: 0 -> N

j

19

Convergence Acceleration on GPU (3/3): Implicit Residual Smoothing on GPU


CFD GPU solvers• Explicit time integration• Implicit time integration

Classification of CFD operations



Implicit Time Stepping is more Stable but …

20

GMRES + Preconditioner

21/35

ILU is costly on GPU

• ILU-GMRES: Small gainon every iteration but ILU setup is slow:

• MCILU-GMRES: Multi-colored ILU fast only for small problems.

0

1

2

3

4

5

6

7

8

9

10

264k 1058k 2249k 4631k

Spee

du

p

Nrows

Speedup Krylov Speedup ILU

0

1

2

3

4

5

6

7

8

9

264k 1058k 2249k 4631kSp

eed

up

Nrows

Speedup Krylov Speedup MCILU

Why not Jacobi PC

0

20

40

60

80

100

120

140

160

180

264k 1058k 2249k 4631k

Spee

du

p

Nrows

Speedup Krylov Speedup Jacobi

• Jacobi-GMRES: very fastbut stable only for small time steps

• Jacobi-GMRES: Speedup decreases for higher CFLs

23

56023412 3586

1178

16568

579

16691

591

CPU ILU GPU ILU CPU OD-ILU GPU OD-ILU

Solve [s] Assemble[s]

On-demand factorization

5.55x11.46x

24

Speedup 1000x

Speedup 10x

Actually GPU is slower

Picture modified from :https://pixabay.com/en/blind-men-elephant-story-feel-see-1458438/

Classification (1/2):GPUs controversy

• GPU thousands of lightweight cores.

• Explicit solver: 10x to 100x speedup.

• Implicit solver: 1x to 10x speedup

We need a classification

25

Speedup 1000x

Speedup 10x


Picture modified from :https://pixabay.com/en/blind-men-elephant-story-feel-see-1458438/

25

Classification (1/2):GPUs controversy

38

Full article: Aissa M. , Verstraete ,T., and Vuik, c.. "Toward a GPU-aware comparison of explicit and implicit CFD simulations on structured meshes." Computers & Mathematics with Applications 74.1 (2017): 201-217.

Speedup 1000x

Speedup 10x


25

Classification (2/2):CFD operations

http://ta.twi.tudelft.nl/nw/users/vuik/papers/Ais17VV.pdf

Performance Comparison: Explicit/Implicit

136x

7x

3x

Ref

GPU Explicit

GPU Implicit

CPU Implicit

CPU Explicit

26

Performance Comparison: Explicit/Implicit

0,001

0,01

0,1

1

10

100

1000

0,1 1 10 100 1000

Implicit CPU Explicit CPU implicit GPU explicit GPU

136x

Normalized wall time

27

Rc=14 (low CFL for implicit =15)

Example of a stator Optimization

0,001

0,01

0,1

1

10

100

1000

0,1 1 10 100 1000

Implicit CPU Explicit CPU

implicit GPU explicit GPU

28/35


29

LS82 cascade

Rc=457 (Explicit solver bad flow convergence)

0,001

0,01

0,1

1

10

100

1000

0,1 1 10 100 1000

Implicit CPU Explicit CPU

implicit GPU explicit GPU

31

LS82 cascade: Results

32

LS82 cascade: Optimized blade

Summary

Choice Explicit/Implicit: Convergence ratio is decisive.

Explicit RANS:100x-180x speedup.

Implicit RANS: 10x-20x speedup(due to slow preconditioning.

On-demand preconditioning:x3 faster but GPU-friendlierpreconditioner is needed.

The classification: an operation-specific acceleration offers more insights.

34

Can your simulation profit from the GPU?

• Where you situate your algorithm (slide 4: QR to ray-tracing)?• Do you need double precision (for half-precision FPGA is faster)?• ready to code (otherwise openACC is easier to use)?• Anyone provided a classification for operation used in your field?

35

Thanks for your attention

Dr. Mohamed H. Aissa

Turbomachinery & Propulsion Department

Email: [email protected]

www.researchgate.net/profile/Mohamed_Aissa3

mailto:[email protected]

http://www.researchgate.net/profile/Mohamed_Aissa3

gpu-accelerated cfd simulations for turbomachinery …...aissa m. , verstraete ,t., and vuik, c.....

Documents