3d adi method for fluid simulation on multiple...

3D ADI Method for Fluid Simulation on Multiple GPUs

Nikolai Sakharnykh, NVIDIA

Nikolay Markovskiy, NVIDIA

Introduction

Fluid simulation using direct numerical methods

— Gives the most accurate result

— Requires lots of memory and computational power

GPUs are very suitable for direct methods

— Have great instruction throughput and high memory bandwidth

How will it scale on multiple GPUs?

cmc-fluid-solver

Open source project on Google Code

— Started at CMC faculty of MSU, Russia

— CPU: OpenMP, GPU: CUDA

3D fluid simulation using ADI solver

Key people:

— MSU: Vilen Paskonov, Sergey Berezin

— NVIDIA: Nikolay Sakharnykh, Nikolay Markovskiy

Outline

Fluid Simulation in 3D domain

— Problem statement, applications

— ADI numerical method

— GPU implementation details, optimizations

— Performance analysis

Multi-GPU implementation

Problem Statement

Viscid incompressible fluid in 3D domain

Arbitrary closed geometry for boundaries

Euler coordinates: velocity and temperature

free

injection

no-slip

Applications

Sea and ocean simulation

— Additional parameters: salinity, etc.

Low-speed gas flow

— Inside 3D channel

— Around objects

Definitions

Equation of state

— Describe relation between and

— Example:

Density

Velocity

Temperature

Pressure

– gas constant for air

Governing equations

Continuity equation

— For incompressible fluids:

Navier-Stokes equations:

— Dimensionless form, use equation of state

– Reynolds number (= inertia/viscosity ratio)

Governing equations

Energy equation:

— Dimensionless form, use equation of state

– heat capacity ratio

– Prandtl number

– dissipative function

ADI numerical method

X Y

Z

Fixed Y, Z Fixed X, Z Fixed X, Y

X Y Z

ADI numerical method

Benefits

— Doesn’t have hard requirements on time step

— Domain decomposition – each step can be well parallelized

Many applications

— Computational Fluid Dynamics

— Computational Finance

Linear 3D PDE

ADI method – iterations

Use global iterations for the whole system of equations

Some equations are not linear:

— Use local iterations to approximate the non-linear term

previous

time step

Solve X-dir equations

Solve Y-dir equations

Solve Z-dir equations

Updating all variables next

time step global iterations

Discretization

Use regular grid, implicit finite difference scheme

— Second order in space

— First order in time

Leads to a tridiagonal system for

— Independent system for each fixed pair (j, k)

Need to solve lots of tridiagonal systems

Sizes of systems may vary across the grid

Tridiagonal systems

Outside cell

Inside cell

Boundary cell

system 1

system 2

system 3

Implementation details

<for each direction X, Y, Z>

{

<for each local iteration>

{

<for each equation u, v, w, T>

{

build tridiagonal matrices and rhs

solve tridiagonal systems

}

update non-linear terms

}

}

GPU implementation

Store all data arrays entirely in GPU memory

— Reduce number of PCI-E transfers to minimum

— Map 3D arrays to linear memory

Main kernel

— Build matrix coefficients

— Solve tridiagonal systems

(X, Y, Z)

Z + Y * dimZ + X * dimY * dimZ

Z – fastest-changing dimension

Building matrices

Input data:

— Previous/non-linear 3D layers

Each thread computes:

— Coefficients of a tridiagonal matrix

— Right-hand side vector

Use C++ templates for direction and equation

a b c

d

Building matrices – performance

Poor Z direction performance compared to X/Y

— Threads access contiguous memory region

— Memory access is uncoalesced, lots of cache misses

Tesl

a C

2050 (

SP)

sec

0.0

0.5

1.0

1.5

2.0

Build Build + Solve

X dir

Y dir

Z dir

Dir Requests

per warp

L1 global

load hit %

IPC

X 2 – 3 25 – 45 1.4

Y 2 – 3 33 – 44 1.4

Z 32 0 – 15 0.2

Build kernels Total time

Building matrices – optimization

Run Z phase in transposed XZY space

— Better locality for memory accesses

— Additional overhead on transpose

XYZ XYZ

X local iterations Y local iterations Z local iterations

Transpose input arrays

Transpose output arrays

Y local iterations

XZY XZY

Building matrices - optimization

Tridiagonal solver time dominates over transpose

— Transpose will takes less % with more local iterations

0.0

0.5

1.0

1.5

2.0

X dir Y dir Z dir Z dirOPT

Transpose

Build + Solve

Tesl

a C

2050 (

SP)

sec

2.5x

Total time

Z dir Requests

per warp

L1 global

load hit %

IPC

Original 32 0 – 15 0.2

Transposed 2 – 3 30 – 38 1.3

Build kernels

Solving tridiagonal systems

Number of tridiagonal systems ~ grid size squared

Sweep algorithm is the most efficient in this case

— 1 thread solves 1 system

for( int p = 1; p < end; p++ ) {

// .. compute tridiagonal coefficients a_val, b_val, c_val, d_val ..

get(c,p) = c_val / (b_val - a_val * get(c,p-1));

get(d,p) = (d_val - get(d,p-1) * a_val) / (b_val - a_val * get(c,p-1));

}

for( int i = end-1; i >= 0; i-- )

get(x,i) = get(d,i) - get(c,i) * get(x, i+1);


Matrix layout is crucial for performance

X, Y directions matrices are interleaved by default

Z is interleaved as well if doing in transposed space

Interleaved layout

a0 a0 a0 a0 a1 a1 a1 a1

Sweep friendly Thre

ad 1

Thre

ad 2

Thre

ad 3

similar as ELLPACK

for sparse matrices


L1/L2 effect on performance

— Using 48K L1 instead of 16K gives 10-15% speed-up

— Turning L1 off reduces performance by 10%

— Really help on misaligned accesses and spatial reuse

Occupancy >= 50%

— Running 128 threads per block

— 26-42 registers per thread (different for u, v, w, T)

— No shared memory

Performance benchmark

CPU configuration:

— Intel Core i7-3930K CPU @ 3.2 GHz, 12 cores

— Use OpenMP for CPU parallelization

Mostly memory bandwidth bound

Some parts achieves ~4x speed-up vs 1 core

GPU configuration:

— NVIDIA Tesla C2070

Test cases

Box Pipe

Simple geometry

Systems of the same size

Need to compute in all rectangular grid points

Y

X

X

Y

Z 1

1

L

Test cases

White Sea

Complex geometry

Big divergence for system sizes

Need to compute only inside the area

Y

X

Performance results – Box Pipe

Grid 128x128x128

0

500

1000

1500

2000

2500

Solve X Solve Y Solve Z Total

CPU

GPU

0

500

1000

1500

2000

2500


CPU

GPU

SINGLE DOUBLE segments/ms segments/ms

9.3x

8.4x

Performance results – White Sea

Grid 256x192x160

SINGLE DOUBLE segments/ms segments/ms

0

500

1000

1500

2000

2500

3000

3500

4000


CPU

GPU

0

500

1000

1500

2000

2500

3000

3500

4000


CPU

GPU10.3x

9.5x

Outline

Fluid Simulation in 3D domain

Multi-GPU implementation

— General splitting algorithm

— Running computations using CUDA

— Benchmarking and performance analysis

— Improving weak scaling

Multi-GPU motivation

Limited available amount of memory

— 3D arrays: grid, temporary arrays, matrices

— Max size of grid that can fit into Tesla M2050 ~ 2243

Distribute the computations between multiple GPUs and

multiple nodes

— Can compute large grids

— Speed-up computations

Main Idea of mGPU

Systems along Y/Z are solved independently in

parallel on each GPU

— No data transfer

Along X data must be synchronized

X Y

Z

GPU 0 GPU 1 GPU 2

Computing alternating directions:

X Y Z

CUDA - parallelization

Split the grid along X (the longest stride)

Z + Y * dimZ + X * dimY * dimZ

Launch kernels on several GPUs from one host thread

Data transfer

— Async P2P through PCI-E (cudaMemcpyPeerAsync)

for (int i = 0; i < numDev; i++)

{

cudaSetDevice(i); //Switch device

kernel<<<…>>>(devArray[i], ..); //Computation

}

CUDA 4.x

Synchronization of Nonlinear Layer

• High aggregate throughput on 8 GPU system

• Communication impact Is not significant

for (int i = 0; i < numDev-1; i++)

cudaMemcpyPeerrAsync(dHaloLeft[i+1], i+1, dDataRight[i], i, num_bytes, devStream[i]);

// might need multidev synchronization here

for (int i = 1; i < numDev; i++)

cudaMemcpyPeerAsync(dHaloRight[i-1], i-1, dDataLeft[i], i, num_bytes, devStream[i]);

Solve X (tridiagonal solver)

GPU 0 GPU 1 GPU 1

bound partially bound unbound halo

Solve X (tridiagonal solver)

• Process bound segments without intercommunication

• Interleave segments for better memory access – one segment per thread

• Align to the left

• Gauss elimination

• Communicate Forward

Backward

Solve X X

Y

Z

Split the grid (“long X”)

• Array[i*dimz*dimy+…]

• Allocation of layers in

mGPU

• 3D segment analysis

Solve X X

Y

Z




mGPU

Forward sweep along X

Active

GPU

Solve X X

Y

Z Active

GPU




mGPU

Back sweep along X

Solve X

Active

GPU

X Y

Z




mGPU

Back sweep along X

Result:

No speedup along X

Benchmarks

Multiple GPU: 8 Tesla M2050 with P2P

Multiple Nodes: 4 InfiniBand MPI nodes, 1 Tesla M2090 each

Sample tests:

Box Pipe

White Sea

X

Y

Z 1

1

L

Results: 8 GPU, 1 MPI node

0

5

10

15

20

25

30

35

Total

Millions

poin

ts p

er

sec

White Sea

1 2 4 8

0

50

100

150

200

250

Total

Millions

poin

ts p

er

sec

Box Pipe

1 2 4 8

x4.5

x1.4

Tesla M2050 Grid 2243

x2.9 x1.35

1 GPU Efficiency

0

20000

40000

60000

80000

0 64 128 192 256

Poin

ts /

ms

Grid size

Box Pipe

0

5000

10000

15000

20000

0 64 128 192 256

Poin

ts /

ms

Grid size

White Sea

Tesla M2090

Estimate amount of work per

GPU in 8xGPU system using

single GPU:

Box Pipe – enough

work for single GPU

White Sea – takes

about 5% of volume of

the grid. Grid size of

1283 is not enough.

2563/8 = 1283

Results: 1 GPU, 4 MPI nodes

0

5

10

15

20

25

30

35

Total

Millions

poin

ts p

er

sec

White Sea

1 2 4

0

20

40

60

80

100

120

140

160

180

200

Total

Millions

poin

ts p

er

sec

Box Pipe

1 2 4

x2.8

x1.2

Tesla M2090

Load Balancing

0

200

400

600

800

1000

1200

1400

0 72 144 216 288

Segm

ents

x

Y(x) + Z(X) + X(x)dX

X splitting criteria:

— Equal volumes

— Equal number of segments

Performance benefit

observed: up to 15.5%

Tesla M2090

0

10

20

SweepX SweepY SweepZ Transpose

Tim

e,

ms Even X GPU 0

GPU 1

GPU 2

GPU 3

0

10

20


Tim

e,

ms Even Segments GPU 0

GPU 1

GPU 2

GPU 3

0

10

20


Tim

e,

ms Even Volumes GPU 0

GPU 1

GPU 2

GPU 3

ttotal= 47.3

ttotal= 44.3

ttotal= 44.4 Tesla M2090

Load Balancing. White Sea (288x320x320)

Analysis

All parts of the solver but one (Gauss elimination along X)

are fully parallel

Communication (using P2P + InfiniBand) is not a big issue for

given problem size

Bad weak scaling

Use blocks to hide latency for X sweeps

Improved Solve X X

Y

Z




mGPU

• 3D segment analysis

GPU0 GPU1 GPU2

Improved Solve X X

Y

Z

GPU0 GPU1 GPU2 B0

B1

B2

B3

B4

Splitting the grid to XY

blocks along Z direction

• Segments sorting

• Sweep through all

scalar fields at once

Improved Solve X X

Y

Z

B0

B1

B2

B3

B4






Forward sweep along X,

Async halo send forward

Improved Solve X X

Y

Z

B0

B1

B2

B3

B4








Move to the next block

group

Improved Solve X X

Y

Z

B0

B1

B2

B3

B4









group

Backward sweep along X,

Async halo send backward

Improved Solve X X

Y

Z

B0

B1

B2

B3

B4









group

Backward sweep along X,

Async halo send backward

Equal work per node!

Algorithm

2(𝑁𝑛𝑜𝑑𝑒𝑠 − 𝑖𝑛𝑜𝑑𝑒 − 1)

𝑖𝑏𝑙𝑜𝑐𝑘 − 𝑖𝑛𝑜𝑑𝑒

𝑛𝑜𝑑𝑒0 𝑖𝑛𝑜𝑑𝑒 𝑁𝑛𝑜𝑑𝑒𝑠 … … 𝑏𝑙𝑜𝑐𝑘0

…

𝑖𝑏𝑙𝑜𝑐𝑘

…

X Y

Z

𝑁𝑍𝑁𝑏𝑙𝑜𝑐𝑘𝑠

receive 𝑋𝑖𝑛𝑜𝑑𝑒−1

receive 𝑋𝑖𝑛𝑜𝑑𝑒+1

cudaStream1

cudaStream2

Algorithm




…


…

X Y

Z


cudaStream1

cudaStream2

Forward

Backward

Algorithm




…


…

X Y

Z


send 𝑋𝑖𝑛𝑜𝑑𝑒

cudaStream1

cudaStream2

send 𝑋𝑖𝑛𝑜𝑑𝑒

Improved Solve XY X

Y

Z

B0

B1

B2

B3

B4

Y blocks

Separate buffer for Y

sweeps

Block Y sweeps are

performed independently

in separate cudaStreams

Helps with data

transfer/compute overlap

Weak Scaling

100

150

200

250

300

350

400

0 2 4 6 8 10

Tim

e,

ms

Number of GPUs

Average time for Solve XYZ

Box Pipe

Grids:

2243, 2883, 3523, 4483

Tesla M2050

Big Systems Limit

0

50

100

150

200

250

1 2 4 8 16 32

Tim

e,

ms

Number of blocks

Average time for Solve XYZ Consider on scalar field:

no physics, more

available RAM

8 M2050 GPUs

Grid: 7683

With larger grid sizes, curve

minimum shifts down/right

Conclusions

GPU outperforms multi-core CPU over 10x factor

GPU works well with complex input domains

Performance and scaling factors heavily depend on input

geometry and size of grid

— Efficient work distribution methods are essential for performance

Using block-splitting for ADI improves scaling factor by

hiding dependency of sweep processing

Future work

Test on large scale systems

— Potentially on “Lomonosov” supercomputer at MSU

— GPU part with peak performance of 863 TFlops

Memory usage optimizations

Explore different tridiagonal approaches

Questions?

Thank You !

3d adi method for fluid simulation on multiple...

Documents