culises: a library for accelerated cfd on hybrid gpu...
Post on 02-Apr-2018
238 Views
Preview:
TRANSCRIPT
FluiDyna GmbH Lichtenbergstraße 8 D-85748 Garching b. München www.fluidyna.com
Culises: A Library for Accelerated CFD on Hybrid GPU-CPU Systems
Dr. Bjoern Landmann
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems Slide 2 B. Landmann
• Brief overview on the company and motivation for GPU-computing
• Library Culises – current status
• Example results
• Current and future development
Content
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
• Complete package
– HPC-hardware
– CFD-consulting
– HPC-software
Area of expertise
Company Overview
Slide 3 B. Landmann
Workstations
GPUs
Cluster
Rackmount-server
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
CFD-Consulting - Examples
Company Overview
Slide 4 B. Landmann
Automotive: Car-truck passing maneuver
Pharmaceutics: Stirred tank bioreactors
Steady simulation (one snapshot only) Small cluster → weeks/months of simulation time Medium cluster (512 CPU cores) → week
Unsteady simulation (multiphase flow) Small cluster → several weeks of simulation time Medium cluster → week
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
• Motivation for GPU-accelerated CFD – Shorter development cycles – Larger models → increased accuracy – (Automated) optimization – … many more …
• LBultra – Lattice-Boltzmann method: speedup of 20x comparing a single GPU and a CPU (4 cores)
• Culises – Library for accelerated CFD on hybrid GPU-CPU systems
HPC-Software based on GPU-computing
Company Overview
Slide 5 B. Landmann
stand-alone version plugin for design suite
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
• Implemented as a dynamic library
• Application interface
– Only transfer solution of expensive linear system(s) from CPUs to GPUs
– Assembly of linear system(s) remains on CPUs
– E.g. established coupling with OpenFOAM® easy to conduct script-based installation
Interface to application
Library Culises
Slide 6 B. Landmann
• OpenFOAM is a free, open source CFD software package with a large user base across most areas of engineering and science
• OpenFOAM has an extensive range of features to solve anything from complex fluid flows involving chemical reactions, turbulence and heat transfer
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
Schematic overview
Library Culises
Slide 7 B. Landmann
OpenFOAM:
Interface: cudaMemcpy(….
cudaMemcpyHostTo Device)
cudaMemcpy(….
cudaMemcpyDeviceTo Host)
Culises: PCG
PBiCG AMGPCG
CPU 0
GPU 0
GPU 1
GPU 2
linear system Ax=b
solution x
OpenFOAM® (1.7.1/2.0.1/2.1.0) MPI-parallelized CPU implementation based on domain decomposition
Culises: Solves linear system(s) on multiple GPUs
CPU 1
CPU 2
MPI-parallel assembly of system matrices remains on CPUs
processor partitioning
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
• State-of-the-art solvers for linear systems – Multi-GPU
– Single or double precision (only DP results are shown)
• Krylov subspace methods – Conjugate or Bi-Conjugate Gradient method
for symmetric and non-symmetric system matrices
– Preconditioning • Jacobi (DiagonalPCG)
• Incomplete Cholesky (DICPCG)
• Algebraic Multigrid (AMGPCG)
• Stand-alone Multigrid method under development
Solvers available
Library Culises
Slide 8 B. Landmann
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
• 1-1 link between MPI-process/rank and GPU -> CPU partitioning equals GPU partitioning -> peak performance CPU << peak perf. GPU -> under-utilization of GPUs • Bunching of MPI-ranks required
n-1 linkage option • GPUDirect
– Peer-to-peer data exchange CUDA 4.1 IPC
– Directly hidden in MPI-implementation release candidates: OpenMPI, MVAPICH2
Parallel approach
Library Culises
Slide 9 B. Landmann
CPU 0
GPU 0
GPU 1
GPU 2
CPU 1
CPU 2
MPI_Comm_size (comm,&size)
1-1 3-1
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
• Amdahl‘s law and theoretical maximum speedup:
Setup
Example results
Slide 10 B. Landmann
speedup s
fraction of computation that is ported to GPU f
acceleration on GPU: a → ∞ a = 15 a = 10 a = 5
𝑠 =1
1 − 𝑓 +𝑓𝑎
𝑠𝑚𝑎𝑥 = lim𝑎→∞
𝑠(𝑎) =1
1 − 𝑓
Efficiency E =𝑠
𝑠𝑚𝑎𝑥
Example: On CPU solution of linear systen consumes 80% of total CPU time: f = 0.8 a = 10 𝑠𝑚𝑎𝑥 = 5 𝑠 = 3.57 E = 0.71
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
• CFD solver: OpenFOAM® 2.0.1/2.1.0 • Fair comparison:
– Best linear solver on CPU vs best linear solver on GPU • Krylov: preconditioned Conjugate Gradient method • Multigrid method
– Needs considerable tuning of solver parameters for both CPU and GPU solvers (multigrid, SIMPLE1 algorithm, …)
– Same convergence criterion: specified tolerance of residual
• Hardware configuration: Tyan board with – 2 CPUs: Intel Xeon X5650 @ 2.67 GHz – 8 GPUs: Tesla 2070 (6GB)
Setup
Example results
Slide 11 B. Landmann
8×
1. Semi-Implicit Method for Pressure-Linked Equations
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
• Generic car shape model • Incompressible flow
– simpleFOAM solver • SIMPLE1 method
– Pressure-velocity coupling – Poisson equation for pressure
linear system solved by Culises
– k-ω SST turbulence model – 2 computational grids
• 3 million grid cells (sequential runs)
• 22 million grid cells (parallel runs)
Automotive: DrivAER
Example results
Slide 12 B. Landmann
DrivAER geometry
solvers { p solver PCG preconditioner DIC tolerance 1e-6 ... }
solvers { p solver PCG PCGGPU preconditioner AMG tolerance 1e-6 ... }
Solver control (OpenFOAM®) via config files
1. Semi-Implicit Method for Pressure-Linked Equations
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
• Single CPU vs single CPU+GPU – Converged solution (4000 timesteps) – Validation: comparison of results
• DICPCG on CPU • AMGPCG on GPU
• Memory requirement – AMGPCG: 40% of 6 GB; 1M cells require 0.80 GB → Tesla 2070: 7.5M cells – DiagonalPCG: 13% of 6 GB; 1M cells require 0.26 GB → Tesla 2070: 23M cells
DrivAER 3M grid cells
Example results
Slide 13 B. Landmann
Single CPU Single CPU+GPU
DICPCG AMGPCG
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
• Speedup with single GPU
DrivAER 3M grid cells
Example results
Slide 14 B. Landmann
Solver CPU
Solver GPU
Fraction 𝑓
Speedup
𝑠 =1
1 − 𝑓 +𝑓𝑎
Theoretical maximum Speedup
𝑠𝑚𝑎𝑥 =1
1 − 𝑓
GPU-acceleration speedup linear solver
𝑎
Efficieny 𝐄 =𝒔
𝒔𝒎𝒂𝒙
GAMG1 AMG PCG
0.55 1.56 2.22 3.36
68%
DICPCG Diagonal PCG
0.78 2.7 4.55 5.8
60%
Diagonal PCG
Diagonal PCG
0.87 4.9 7.7 11.6
64%
1. GAMG: Generalized geometric-algebraic Multigrid solver geometric agglomeration based on grid faces area
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
0
0.5
1
1.5
2
2.5
3
3.5
0
200
400
600
800
1000
1200
0 1 2 3 4 5 6 7
Scal
ing
Sim
ula
tio
n t
ime
total time time linear solver scaling total scaling linear solver
• Performance with multiple GPUs • Strong scaling: multiple CPUs+GPUs (1-1 linkage)
– Scaling of total code versus # of CPUs and # of GPUs – Scaling of linear solver versus # of CPUs and # of GPUs
DrivAER 3M grid cells
Example results
Slide 15 B. Landmann
AMGPCG solver
# of CPUs = # of GPUs
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
• Speedup by adding multiple GPUs (1-1 linkage)
DrivAER 3M grid
Example results
Slide 16 B. Landmann
Solver CPU vs Solver GPU
Speedup total
s 1 CPU +1 GPU
Speedup total
s 2 CPUs +2 GPUs
Speedup total
s 4 CPUs +4 GPUs
Speedup total
s 6 CPUs +6 GPUs
Speedup Linear solver
𝑎 1 CPU +1 GPU
Speedup Linear solver
𝑎 2 CPUs +2 GPUs
Speedup Linear solver
𝑎 4 CPUs +4 GPUs
Speedup Linear solver
𝑎 6 CPUs +6 GPUs
GAMG vs AMG PCG
1.56 1.64 1.29 1.27 3.36 3.06 2.38 2.13
DICPCG vs Diagonal PCG
2.7 1.49 1.20 1.45 5.8 1.95 1.46 1.84
Diagonal PCG vs Diagonal PCG
4.9 2.84 1.79 2.03 11.6 4.14 2.39 2.80
Example: computation is 2.84 times faster when running on 2 GPUs + 2 CPUs than running on 2 CPUs only
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
0
0.5
1
1.5
2
2.5
0
1000
2000
3000
4000
5000
6000
3 4 6 8
Scal
ing
Sim
ula
tio
n t
ime
total time time linear solvertotal time CPU only time linear solver CPU onlyscaling total scaling linear solverscaling total CPU only scaling linear solver CPU only
• Performance with multiple GPUs, for memory reasons minimum 3 GPUs needed (GPU memory usage ≈90%)
DrivAER 22M grid cells
Example results
Slide 17 B. Landmann
GAMG on CPUs only (dashed) AMGPCG on CPUs+GPUs (solid)
# of CPUs = # of GPUs
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
• Speedup by adding multiple GPUs (1-1 linkage) GAMG solver vs AMGPCG solver
• Utilization not optimal Further optimization under development n-1 linkage between CPU-GPU
DrivAER 22M grid cells
Example results
Slide 18 B. Landmann
# of CPUs # of GPUs added
3 CPUs +3 GPUs
4 CPUs +4 GPUs
6 CPUs +6 GPUs
8 CPUs +8 GPUs
Speedup s 1.56 1.58 1.54 1.42
Speedup linear solver 𝑎 3.4 2.81 2.91 2.33
Fraction f 0.60 0.59 0.57 0.50
Theoretical max speedup 𝑠𝑚𝑎𝑥
2.50 2.43 2.33 2.00
Efficiency E 62% 65% 66% 71%
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
• LTSinterFoam solver – Steady with use of
local time stepping method – Volume of fluid (VoF)
method – Pressure solver
linear system → Culises
• 4M grid cells
Multiphase flow: ship hull
Example results
Slide 19 B. Landmann
Solver CPU
Solver GPU
Fraction f
Speedup s Theoret. maximum speedup
GPU-acceleration linear solver 𝑎
Efficiency E
DICPCG Diagonal PCG
0.43 1.54 1.75 4.91
88%
Diagonal PCG
Diagonal PCG
0.55 2.12 2.22 8.66
95%
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
• buoyantPimpleFoam solver – Unsteady PISO1 method – Pressure solver
linear system → Culises
• 4M grid cells
Heat transfer: heated room
Example results
Slide 20 B. Landmann
Solver CPU
Solver GPU
Fraction f
Speedup S Theoret. maximum speedup
GPU-acceleration linear solver
𝑎 Efficiency E
DICPCG Diagonal PCG
0.72 2.45 3.57 6.11
69%
Diagonal PCG
Diagonal PCG
0.80 3.59 5.00 9.90
72%
1. Pressure-Implicit with Splitting of Operators
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
• pisoFoam solver – unsteady
– Pressure solver linear system → Culises
– 500K grid cells
Process industry: flow molding
Example results
Slide 21 B. Landmann
Solver CPU
Solver GPU
Fraction f
Speedup S Theoret. maximum speedup
GPU-acceleration linear solver
𝑎 Efficiency E
DICPCG Diagonal PCG
0.84 2.65 6.25 3.6
42%
Diagonal PCG
Diagonal PCG
0.94 6.9 16.7 10.4
42%
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
• interFoam solver – Unsteady – VoF method – Pressure solver
linear system → Culises
• 500k grid cells
Pharmaceutical: generic bioreactor
Example results
Slide 22 B. Landmann
liquid surface
shaking device (off-centered spindle)
Solver CPU
Solver GPU
Fraction f
Speedup S Theoret. maximum speedup
GPU-acceleration linear solver
𝑎 Efficiency E
GAMG AMGPCG 0.53 1.44 2.12 2.59
68%
Diagonal PCG
Diagonal PCG
0.81 3.00 5.26 5.94
57%
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
1.6 1.9
3 2.22
4.7
3.4
6.8
8
4.27
7
1 1 1 1 1 1 1 1 1 1
automotive multiphase heat transfer pharmaceutics process industry
Speedup Acceleration OpenFOAM® basic Efficiency
• Speedup categorized by application
Summary
Slide 23 B. Landmann
65%
91%
70%
42%
63%
obtained from (averaged) single GPU test cases
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems
• Stand-alone multigrid solver
• Multi-GPU usage and scalability
– Optimized load balancing
via n-1 linkage between CPU-GPU
– Optimized data exchange
via peer-to-peer (PCIe 2.0/3.0) transfers
Under development
Future Culises features
Slide 24 B. Landmann
Culises - A Library for Accelerated CFD on Hybrid GPU-CPU Systems Slide 25 B. Landmann
Questions?
top related