culises: a library for accelerated cfd on hybrid gpu...

FluiDyna GmbH Lichtenbergstraße 8 D-85748 Garching b. München www.fluidyna.com

Culises: A Library for Accelerated CFD on Hybrid GPU-CPU Systems

Dr. Bjoern Landmann

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems Slide 2 B. Landmann

• Brief overview on the company and motivation for GPU-computing

• Library Culises – current status

• Example results

• Current and future development

Content

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• Complete package

– HPC-hardware

– CFD-consulting

– HPC-software

Area of expertise

Company Overview

B. Landmann

Workstations

Cluster

Rackmount-server

CFD-Consulting - Examples

Company Overview

B. Landmann

Automotive: Car-truck passing maneuver

Pharmaceutics: Stirred tank bioreactors

Steady simulation (one snapshot only) Small cluster → weeks/months of simulation time Medium cluster (512 CPU cores) → week

Unsteady simulation (multiphase flow) Small cluster → several weeks of simulation time Medium cluster → week

• Motivation for GPU-accelerated CFD – Shorter development cycles – Larger models → increased accuracy – (Automated) optimization – … many more …

• LBultra – Lattice-Boltzmann method: speedup of 20x comparing a single GPU and a CPU (4 cores)

• Culises – Library for accelerated CFD on hybrid GPU-CPU systems

HPC-Software based on GPU-computing

Company Overview

B. Landmann

stand-alone version plugin for design suite

• Implemented as a dynamic library

• Application interface

– Only transfer solution of expensive linear system(s) from CPUs to GPUs

– Assembly of linear system(s) remains on CPUs

– E.g. established coupling with OpenFOAM® easy to conduct script-based installation

Interface to application

Library Culises

B. Landmann

• OpenFOAM is a free, open source CFD software package with a large user base across most areas of engineering and science

• OpenFOAM has an extensive range of features to solve anything from complex fluid flows involving chemical reactions, turbulence and heat transfer

Schematic overview

Library Culises

B. Landmann

OpenFOAM:

Interface: cudaMemcpy(….

cudaMemcpyHostTo Device)

cudaMemcpy(….

cudaMemcpyDeviceTo Host)

Culises: PCG

PBiCG AMGPCG

linear system Ax=b

solution x

OpenFOAM® (1.7.1/2.0.1/2.1.0) MPI-parallelized CPU implementation based on domain decomposition

Culises: Solves linear system(s) on multiple GPUs

MPI-parallel assembly of system matrices remains on CPUs

processor partitioning

• State-of-the-art solvers for linear systems – Multi-GPU

– Single or double precision (only DP results are shown)

• Krylov subspace methods – Conjugate or Bi-Conjugate Gradient method

for symmetric and non-symmetric system matrices

– Preconditioning • Jacobi (DiagonalPCG)

• Incomplete Cholesky (DICPCG)

• Algebraic Multigrid (AMGPCG)

• Stand-alone Multigrid method under development

Solvers available

Library Culises

B. Landmann

• 1-1 link between MPI-process/rank and GPU -> CPU partitioning equals GPU partitioning -> peak performance CPU << peak perf. GPU -> under-utilization of GPUs • Bunching of MPI-ranks required

n-1 linkage option • GPUDirect

– Peer-to-peer data exchange CUDA 4.1 IPC

– Directly hidden in MPI-implementation release candidates: OpenMPI, MVAPICH2

Parallel approach

Library Culises

B. Landmann

MPI_Comm_size (comm,&size)

1-1 3-1

• Amdahl‘s law and theoretical maximum speedup:

Example results

B. Landmann

speedup s

fraction of computation that is ported to GPU f

acceleration on GPU: a → ∞ a = 15 a = 10 a = 5

𝑠 =1

1 − 𝑓 +𝑓𝑎

𝑠𝑚𝑎𝑥 = lim𝑎→∞

𝑠(𝑎) =1

1 − 𝑓

Efficiency E =𝑠

𝑠𝑚𝑎𝑥

Example: On CPU solution of linear systen consumes 80% of total CPU time: f = 0.8 a = 10 𝑠𝑚𝑎𝑥 = 5 𝑠 = 3.57 E = 0.71

• CFD solver: OpenFOAM® 2.0.1/2.1.0 • Fair comparison:

– Best linear solver on CPU vs best linear solver on GPU • Krylov: preconditioned Conjugate Gradient method • Multigrid method

– Needs considerable tuning of solver parameters for both CPU and GPU solvers (multigrid, SIMPLE1 algorithm, …)

– Same convergence criterion: specified tolerance of residual

• Hardware configuration: Tyan board with – 2 CPUs: Intel Xeon X5650 @ 2.67 GHz – 8 GPUs: Tesla 2070 (6GB)

Example results

B. Landmann

1. Semi-Implicit Method for Pressure-Linked Equations

• Generic car shape model • Incompressible flow

– simpleFOAM solver • SIMPLE1 method

– Pressure-velocity coupling – Poisson equation for pressure

linear system solved by Culises

– k-ω SST turbulence model – 2 computational grids

• 3 million grid cells (sequential runs)

• 22 million grid cells (parallel runs)

Automotive: DrivAER

Example results

B. Landmann

DrivAER geometry

solvers { p solver PCG preconditioner DIC tolerance 1e-6 ... }

solvers { p solver PCG PCGGPU preconditioner AMG tolerance 1e-6 ... }

Solver control (OpenFOAM®) via config files

1. Semi-Implicit Method for Pressure-Linked Equations

• Single CPU vs single CPU+GPU – Converged solution (4000 timesteps) – Validation: comparison of results

• DICPCG on CPU • AMGPCG on GPU

• Memory requirement – AMGPCG: 40% of 6 GB; 1M cells require 0.80 GB → Tesla 2070: 7.5M cells – DiagonalPCG: 13% of 6 GB; 1M cells require 0.26 GB → Tesla 2070: 23M cells

DrivAER 3M grid cells

Example results

B. Landmann

Single CPU Single CPU+GPU

DICPCG AMGPCG

• Speedup with single GPU

Example results

B. Landmann

Solver CPU

Solver GPU

Fraction 𝑓

Speedup

𝑠 =1

1 − 𝑓 +𝑓𝑎

Theoretical maximum Speedup

𝑠𝑚𝑎𝑥 =1

1 − 𝑓

GPU-acceleration speedup linear solver

Efficieny 𝐄 =𝒔

𝒔𝒎𝒂𝒙

GAMG1 AMG PCG

0.55 1.56 2.22 3.36

DICPCG Diagonal PCG

0.78 2.7 4.55 5.8

Diagonal PCG

0.87 4.9 7.7 11.6

1. GAMG: Generalized geometric-algebraic Multigrid solver geometric agglomeration based on grid faces area

0 1 2 3 4 5 6 7

total time time linear solver scaling total scaling linear solver

• Performance with multiple GPUs • Strong scaling: multiple CPUs+GPUs (1-1 linkage)

– Scaling of total code versus # of CPUs and # of GPUs – Scaling of linear solver versus # of CPUs and # of GPUs

Example results

B. Landmann

AMGPCG solver

# of CPUs = # of GPUs

• Speedup by adding multiple GPUs (1-1 linkage)

DrivAER 3M grid

Example results

B. Landmann

Solver CPU vs Solver GPU

Speedup total

s 1 CPU +1 GPU

Speedup total

s 2 CPUs +2 GPUs

Speedup total

s 4 CPUs +4 GPUs

Speedup total

s 6 CPUs +6 GPUs

Speedup Linear solver

𝑎 1 CPU +1 GPU

𝑎 2 CPUs +2 GPUs

𝑎 4 CPUs +4 GPUs

𝑎 6 CPUs +6 GPUs

GAMG vs AMG PCG

1.56 1.64 1.29 1.27 3.36 3.06 2.38 2.13

DICPCG vs Diagonal PCG

2.7 1.49 1.20 1.45 5.8 1.95 1.46 1.84

Diagonal PCG vs Diagonal PCG

4.9 2.84 1.79 2.03 11.6 4.14 2.39 2.80

Example: computation is 2.84 times faster when running on 2 GPUs + 2 CPUs than running on 2 CPUs only

3 4 6 8

total time time linear solvertotal time CPU only time linear solver CPU onlyscaling total scaling linear solverscaling total CPU only scaling linear solver CPU only

• Performance with multiple GPUs, for memory reasons minimum 3 GPUs needed (GPU memory usage ≈90%)

Example results

B. Landmann

GAMG on CPUs only (dashed) AMGPCG on CPUs+GPUs (solid)

# of CPUs = # of GPUs

• Speedup by adding multiple GPUs (1-1 linkage) GAMG solver vs AMGPCG solver

• Utilization not optimal Further optimization under development n-1 linkage between CPU-GPU

Example results

B. Landmann

# of CPUs # of GPUs added

3 CPUs +3 GPUs

4 CPUs +4 GPUs

6 CPUs +6 GPUs

8 CPUs +8 GPUs

Speedup s 1.56 1.58 1.54 1.42

Speedup linear solver 𝑎 3.4 2.81 2.91 2.33

Fraction f 0.60 0.59 0.57 0.50

Theoretical max speedup 𝑠𝑚𝑎𝑥

2.50 2.43 2.33 2.00

Efficiency E 62% 65% 66% 71%

• LTSinterFoam solver – Steady with use of

local time stepping method – Volume of fluid (VoF)

method – Pressure solver

linear system → Culises

• 4M grid cells

Multiphase flow: ship hull

Example results

B. Landmann

Solver CPU

Solver GPU

Fraction f

Speedup s Theoret. maximum speedup

GPU-acceleration linear solver 𝑎

Efficiency E

DICPCG Diagonal PCG

0.43 1.54 1.75 4.91

Diagonal PCG

0.55 2.12 2.22 8.66

• buoyantPimpleFoam solver – Unsteady PISO1 method – Pressure solver

• 4M grid cells

Heat transfer: heated room

Example results

B. Landmann

Solver CPU

Solver GPU

Fraction f

Speedup S Theoret. maximum speedup

GPU-acceleration linear solver

𝑎 Efficiency E

DICPCG Diagonal PCG

0.72 2.45 3.57 6.11

Diagonal PCG

0.80 3.59 5.00 9.90

1. Pressure-Implicit with Splitting of Operators

• pisoFoam solver – unsteady

– Pressure solver linear system → Culises

– 500K grid cells

Process industry: flow molding

Example results

B. Landmann

Solver CPU

Solver GPU

Fraction f

𝑎 Efficiency E

DICPCG Diagonal PCG

0.84 2.65 6.25 3.6

Diagonal PCG

0.94 6.9 16.7 10.4

• interFoam solver – Unsteady – VoF method – Pressure solver

• 500k grid cells

Pharmaceutical: generic bioreactor

Example results

B. Landmann

liquid surface

shaking device (off-centered spindle)

Solver CPU

Solver GPU

Fraction f

𝑎 Efficiency E

GAMG AMGPCG 0.53 1.44 2.12 2.59

Diagonal PCG

0.81 3.00 5.26 5.94

1.6 1.9

3 2.22

1 1 1 1 1 1 1 1 1 1

automotive multiphase heat transfer pharmaceutics process industry

Speedup Acceleration OpenFOAM® basic Efficiency

• Speedup categorized by application

Summary

B. Landmann

obtained from (averaged) single GPU test cases

• Stand-alone multigrid solver

• Multi-GPU usage and scalability

– Optimized load balancing

via n-1 linkage between CPU-GPU

– Optimized data exchange

via peer-to-peer (PCIe 2.0/3.0) transfers

Under development

Future Culises features

B. Landmann

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems Slide 25 B. Landmann

Questions?

culises: a library for accelerated cfd on hybrid gpu...

Documents

its nondegree 14915 presentationpdf

its paper 23215 presentationpdf

its paper 19998 2107030027 presentationpdf

optimization of parameter settings for gamg solver in...

its undergraduate 15965 3104100021 presentationpdf

optimization of parameter settings for gamg solver in simple...

digital marketing class presentationpdf

fine-grained parallel iterative solvers in openfoam ... ·...

1 *gamg is pending sec/finra registration as a ria firm. all...

instructional workshop on openfoam programming...

its master-14695-presentationpdf

its undergraduate 14892 presentationpdf

casi solver

the actioncar case 5th ukri openfoam users meeting dr...

solver iq solver iq solver iq solver iq solver iq solver iq...

its undergraduate 14869 presentationpdf

its undergraduate-14581-presentationpdf

its master 14666 presentationpdf

global village math intro presentationpdf

using petsc solvers in pylith -...