culises: a library for accelerated cfd on hybrid gpu...

25
FluiDyna GmbH Lichtenbergstraße 8 D-85748 Garching b. München www.fluidyna.com Culises: A Library for Accelerated CFD on Hybrid GPU-CPU Systems Dr. Bjoern Landmann

Upload: vuongcong

Post on 02-Apr-2018

237 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

FluiDyna GmbH Lichtenbergstraße 8 D-85748 Garching b. München www.fluidyna.com

Culises: A Library for Accelerated CFD on Hybrid GPU-CPU Systems

Dr. Bjoern Landmann

Page 2: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems Slide 2 B. Landmann

• Brief overview on the company and motivation for GPU-computing

• Library Culises – current status

• Example results

• Current and future development

Content

Page 3: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• Complete package

– HPC-hardware

– CFD-consulting

– HPC-software

Area of expertise

Company Overview

Slide 3 B. Landmann

Workstations

GPUs

Cluster

Rackmount-server

Page 4: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

CFD-Consulting - Examples

Company Overview

Slide 4 B. Landmann

Automotive: Car-truck passing maneuver

Pharmaceutics: Stirred tank bioreactors

Steady simulation (one snapshot only) Small cluster → weeks/months of simulation time Medium cluster (512 CPU cores) → week

Unsteady simulation (multiphase flow) Small cluster → several weeks of simulation time Medium cluster → week

Page 5: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• Motivation for GPU-accelerated CFD – Shorter development cycles – Larger models → increased accuracy – (Automated) optimization – … many more …

• LBultra – Lattice-Boltzmann method: speedup of 20x comparing a single GPU and a CPU (4 cores)

• Culises – Library for accelerated CFD on hybrid GPU-CPU systems

HPC-Software based on GPU-computing

Company Overview

Slide 5 B. Landmann

stand-alone version plugin for design suite

Page 6: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• Implemented as a dynamic library

• Application interface

– Only transfer solution of expensive linear system(s) from CPUs to GPUs

– Assembly of linear system(s) remains on CPUs

– E.g. established coupling with OpenFOAM® easy to conduct script-based installation

Interface to application

Library Culises

Slide 6 B. Landmann

• OpenFOAM is a free, open source CFD software package with a large user base across most areas of engineering and science

• OpenFOAM has an extensive range of features to solve anything from complex fluid flows involving chemical reactions, turbulence and heat transfer

Page 7: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

Schematic overview

Library Culises

Slide 7 B. Landmann

OpenFOAM:

Interface: cudaMemcpy(….

cudaMemcpyHostTo Device)

cudaMemcpy(….

cudaMemcpyDeviceTo Host)

Culises: PCG

PBiCG AMGPCG

CPU 0

GPU 0

GPU 1

GPU 2

linear system Ax=b

solution x

OpenFOAM® (1.7.1/2.0.1/2.1.0) MPI-parallelized CPU implementation based on domain decomposition

Culises: Solves linear system(s) on multiple GPUs

CPU 1

CPU 2

MPI-parallel assembly of system matrices remains on CPUs

processor partitioning

Page 8: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• State-of-the-art solvers for linear systems – Multi-GPU

– Single or double precision (only DP results are shown)

• Krylov subspace methods – Conjugate or Bi-Conjugate Gradient method

for symmetric and non-symmetric system matrices

– Preconditioning • Jacobi (DiagonalPCG)

• Incomplete Cholesky (DICPCG)

• Algebraic Multigrid (AMGPCG)

• Stand-alone Multigrid method under development

Solvers available

Library Culises

Slide 8 B. Landmann

Page 9: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• 1-1 link between MPI-process/rank and GPU -> CPU partitioning equals GPU partitioning -> peak performance CPU << peak perf. GPU -> under-utilization of GPUs • Bunching of MPI-ranks required

n-1 linkage option • GPUDirect

– Peer-to-peer data exchange CUDA 4.1 IPC

– Directly hidden in MPI-implementation release candidates: OpenMPI, MVAPICH2

Parallel approach

Library Culises

Slide 9 B. Landmann

CPU 0

GPU 0

GPU 1

GPU 2

CPU 1

CPU 2

MPI_Comm_size (comm,&size)

1-1 3-1

Page 10: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• Amdahl‘s law and theoretical maximum speedup:

Setup

Example results

Slide 10 B. Landmann

speedup s

fraction of computation that is ported to GPU f

acceleration on GPU: a → ∞ a = 15 a = 10 a = 5

𝑠 =1

1 − 𝑓 +𝑓𝑎

𝑠𝑚𝑎𝑥 = lim𝑎→∞

𝑠(𝑎) =1

1 − 𝑓

Efficiency E =𝑠

𝑠𝑚𝑎𝑥

Example: On CPU solution of linear systen consumes 80% of total CPU time: f = 0.8 a = 10 𝑠𝑚𝑎𝑥 = 5 𝑠 = 3.57 E = 0.71

Page 11: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• CFD solver: OpenFOAM® 2.0.1/2.1.0 • Fair comparison:

– Best linear solver on CPU vs best linear solver on GPU • Krylov: preconditioned Conjugate Gradient method • Multigrid method

– Needs considerable tuning of solver parameters for both CPU and GPU solvers (multigrid, SIMPLE1 algorithm, …)

– Same convergence criterion: specified tolerance of residual

• Hardware configuration: Tyan board with – 2 CPUs: Intel Xeon X5650 @ 2.67 GHz – 8 GPUs: Tesla 2070 (6GB)

Setup

Example results

Slide 11 B. Landmann

1. Semi-Implicit Method for Pressure-Linked Equations

Page 12: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• Generic car shape model • Incompressible flow

– simpleFOAM solver • SIMPLE1 method

– Pressure-velocity coupling – Poisson equation for pressure

linear system solved by Culises

– k-ω SST turbulence model – 2 computational grids

• 3 million grid cells (sequential runs)

• 22 million grid cells (parallel runs)

Automotive: DrivAER

Example results

Slide 12 B. Landmann

DrivAER geometry

solvers { p solver PCG preconditioner DIC tolerance 1e-6 ... }

solvers { p solver PCG PCGGPU preconditioner AMG tolerance 1e-6 ... }

Solver control (OpenFOAM®) via config files

1. Semi-Implicit Method for Pressure-Linked Equations

Page 13: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• Single CPU vs single CPU+GPU – Converged solution (4000 timesteps) – Validation: comparison of results

• DICPCG on CPU • AMGPCG on GPU

• Memory requirement – AMGPCG: 40% of 6 GB; 1M cells require 0.80 GB → Tesla 2070: 7.5M cells – DiagonalPCG: 13% of 6 GB; 1M cells require 0.26 GB → Tesla 2070: 23M cells

DrivAER 3M grid cells

Example results

Slide 13 B. Landmann

Single CPU Single CPU+GPU

DICPCG AMGPCG

Page 14: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• Speedup with single GPU

DrivAER 3M grid cells

Example results

Slide 14 B. Landmann

Solver CPU

Solver GPU

Fraction 𝑓

Speedup

𝑠 =1

1 − 𝑓 +𝑓𝑎

Theoretical maximum Speedup

𝑠𝑚𝑎𝑥 =1

1 − 𝑓

GPU-acceleration speedup linear solver

𝑎

Efficieny 𝐄 =𝒔

𝒔𝒎𝒂𝒙

GAMG1 AMG PCG

0.55 1.56 2.22 3.36

68%

DICPCG Diagonal PCG

0.78 2.7 4.55 5.8

60%

Diagonal PCG

Diagonal PCG

0.87 4.9 7.7 11.6

64%

1. GAMG: Generalized geometric-algebraic Multigrid solver geometric agglomeration based on grid faces area

Page 15: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

0

0.5

1

1.5

2

2.5

3

3.5

0

200

400

600

800

1000

1200

0 1 2 3 4 5 6 7

Scal

ing

Sim

ula

tio

n t

ime

total time time linear solver scaling total scaling linear solver

• Performance with multiple GPUs • Strong scaling: multiple CPUs+GPUs (1-1 linkage)

– Scaling of total code versus # of CPUs and # of GPUs – Scaling of linear solver versus # of CPUs and # of GPUs

DrivAER 3M grid cells

Example results

Slide 15 B. Landmann

AMGPCG solver

# of CPUs = # of GPUs

Page 16: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• Speedup by adding multiple GPUs (1-1 linkage)

DrivAER 3M grid

Example results

Slide 16 B. Landmann

Solver CPU vs Solver GPU

Speedup total

s 1 CPU +1 GPU

Speedup total

s 2 CPUs +2 GPUs

Speedup total

s 4 CPUs +4 GPUs

Speedup total

s 6 CPUs +6 GPUs

Speedup Linear solver

𝑎 1 CPU +1 GPU

Speedup Linear solver

𝑎 2 CPUs +2 GPUs

Speedup Linear solver

𝑎 4 CPUs +4 GPUs

Speedup Linear solver

𝑎 6 CPUs +6 GPUs

GAMG vs AMG PCG

1.56 1.64 1.29 1.27 3.36 3.06 2.38 2.13

DICPCG vs Diagonal PCG

2.7 1.49 1.20 1.45 5.8 1.95 1.46 1.84

Diagonal PCG vs Diagonal PCG

4.9 2.84 1.79 2.03 11.6 4.14 2.39 2.80

Example: computation is 2.84 times faster when running on 2 GPUs + 2 CPUs than running on 2 CPUs only

Page 17: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

0

0.5

1

1.5

2

2.5

0

1000

2000

3000

4000

5000

6000

3 4 6 8

Scal

ing

Sim

ula

tio

n t

ime

total time time linear solvertotal time CPU only time linear solver CPU onlyscaling total scaling linear solverscaling total CPU only scaling linear solver CPU only

• Performance with multiple GPUs, for memory reasons minimum 3 GPUs needed (GPU memory usage ≈90%)

DrivAER 22M grid cells

Example results

Slide 17 B. Landmann

GAMG on CPUs only (dashed) AMGPCG on CPUs+GPUs (solid)

# of CPUs = # of GPUs

Page 18: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• Speedup by adding multiple GPUs (1-1 linkage) GAMG solver vs AMGPCG solver

• Utilization not optimal Further optimization under development n-1 linkage between CPU-GPU

DrivAER 22M grid cells

Example results

Slide 18 B. Landmann

# of CPUs # of GPUs added

3 CPUs +3 GPUs

4 CPUs +4 GPUs

6 CPUs +6 GPUs

8 CPUs +8 GPUs

Speedup s 1.56 1.58 1.54 1.42

Speedup linear solver 𝑎 3.4 2.81 2.91 2.33

Fraction f 0.60 0.59 0.57 0.50

Theoretical max speedup 𝑠𝑚𝑎𝑥

2.50 2.43 2.33 2.00

Efficiency E 62% 65% 66% 71%

Page 19: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• LTSinterFoam solver – Steady with use of

local time stepping method – Volume of fluid (VoF)

method – Pressure solver

linear system → Culises

• 4M grid cells

Multiphase flow: ship hull

Example results

Slide 19 B. Landmann

Solver CPU

Solver GPU

Fraction f

Speedup s Theoret. maximum speedup

GPU-acceleration linear solver 𝑎

Efficiency E

DICPCG Diagonal PCG

0.43 1.54 1.75 4.91

88%

Diagonal PCG

Diagonal PCG

0.55 2.12 2.22 8.66

95%

Page 20: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• buoyantPimpleFoam solver – Unsteady PISO1 method – Pressure solver

linear system → Culises

• 4M grid cells

Heat transfer: heated room

Example results

Slide 20 B. Landmann

Solver CPU

Solver GPU

Fraction f

Speedup S Theoret. maximum speedup

GPU-acceleration linear solver

𝑎 Efficiency E

DICPCG Diagonal PCG

0.72 2.45 3.57 6.11

69%

Diagonal PCG

Diagonal PCG

0.80 3.59 5.00 9.90

72%

1. Pressure-Implicit with Splitting of Operators

Page 21: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• pisoFoam solver – unsteady

– Pressure solver linear system → Culises

– 500K grid cells

Process industry: flow molding

Example results

Slide 21 B. Landmann

Solver CPU

Solver GPU

Fraction f

Speedup S Theoret. maximum speedup

GPU-acceleration linear solver

𝑎 Efficiency E

DICPCG Diagonal PCG

0.84 2.65 6.25 3.6

42%

Diagonal PCG

Diagonal PCG

0.94 6.9 16.7 10.4

42%

Page 22: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• interFoam solver – Unsteady – VoF method – Pressure solver

linear system → Culises

• 500k grid cells

Pharmaceutical: generic bioreactor

Example results

Slide 22 B. Landmann

liquid surface

shaking device (off-centered spindle)

Solver CPU

Solver GPU

Fraction f

Speedup S Theoret. maximum speedup

GPU-acceleration linear solver

𝑎 Efficiency E

GAMG AMGPCG 0.53 1.44 2.12 2.59

68%

Diagonal PCG

Diagonal PCG

0.81 3.00 5.26 5.94

57%

Page 23: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

1.6 1.9

3 2.22

4.7

3.4

6.8

8

4.27

7

1 1 1 1 1 1 1 1 1 1

automotive multiphase heat transfer pharmaceutics process industry

Speedup Acceleration OpenFOAM® basic Efficiency

• Speedup categorized by application

Summary

Slide 23 B. Landmann

65%

91%

70%

42%

63%

obtained from (averaged) single GPU test cases

Page 24: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems

• Stand-alone multigrid solver

• Multi-GPU usage and scalability

– Optimized load balancing

via n-1 linkage between CPU-GPU

– Optimized data exchange

via peer-to-peer (PCIe 2.0/3.0) transfers

Under development

Future Culises features

Slide 24 B. Landmann

Page 25: Culises: A Library for Accelerated CFD on Hybrid GPU …developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S... · GAMG solver vs AMGPCG solver • Utilization not optimal

Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems Slide 25 B. Landmann

Questions?