3d adi method for fluid simulation on multiple...

66
3D ADI Method for Fluid Simulation on Multiple GPUs Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Upload: vuongmien

Post on 12-Feb-2019

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

3D ADI Method for Fluid Simulation on Multiple GPUs

Nikolai Sakharnykh, NVIDIA

Nikolay Markovskiy, NVIDIA

Page 2: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Introduction

Fluid simulation using direct numerical methods

— Gives the most accurate result

— Requires lots of memory and computational power

GPUs are very suitable for direct methods

— Have great instruction throughput and high memory bandwidth

How will it scale on multiple GPUs?

Page 3: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

cmc-fluid-solver

Open source project on Google Code

— Started at CMC faculty of MSU, Russia

— CPU: OpenMP, GPU: CUDA

3D fluid simulation using ADI solver

Key people:

— MSU: Vilen Paskonov, Sergey Berezin

— NVIDIA: Nikolay Sakharnykh, Nikolay Markovskiy

Page 4: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Outline

Fluid Simulation in 3D domain

— Problem statement, applications

— ADI numerical method

— GPU implementation details, optimizations

— Performance analysis

Multi-GPU implementation

Page 5: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Problem Statement

Viscid incompressible fluid in 3D domain

Arbitrary closed geometry for boundaries

Euler coordinates: velocity and temperature

free

injection

no-slip

Page 6: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Applications

Sea and ocean simulation

— Additional parameters: salinity, etc.

Low-speed gas flow

— Inside 3D channel

— Around objects

Page 7: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Definitions

Equation of state

— Describe relation between and

— Example:

Density

Velocity

Temperature

Pressure

– gas constant for air

Page 8: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Governing equations

Continuity equation

— For incompressible fluids:

Navier-Stokes equations:

— Dimensionless form, use equation of state

– Reynolds number (= inertia/viscosity ratio)

Page 9: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Governing equations

Energy equation:

— Dimensionless form, use equation of state

– heat capacity ratio

– Prandtl number

– dissipative function

Page 10: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

ADI numerical method

X Y

Z

Fixed Y, Z Fixed X, Z Fixed X, Y

X Y Z

Page 11: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

ADI numerical method

Benefits

— Doesn’t have hard requirements on time step

— Domain decomposition – each step can be well parallelized

Many applications

— Computational Fluid Dynamics

— Computational Finance

Linear 3D PDE

Page 12: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

ADI method – iterations

Use global iterations for the whole system of equations

Some equations are not linear:

— Use local iterations to approximate the non-linear term

previous

time step

Solve X-dir equations

Solve Y-dir equations

Solve Z-dir equations

Updating all variables next

time step global iterations

Page 13: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Discretization

Use regular grid, implicit finite difference scheme

— Second order in space

— First order in time

Leads to a tridiagonal system for

— Independent system for each fixed pair (j, k)

Page 14: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Need to solve lots of tridiagonal systems

Sizes of systems may vary across the grid

Tridiagonal systems

Outside cell

Inside cell

Boundary cell

system 1

system 2

system 3

Page 15: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Implementation details

<for each direction X, Y, Z>

{

<for each local iteration>

{

<for each equation u, v, w, T>

{

build tridiagonal matrices and rhs

solve tridiagonal systems

}

update non-linear terms

}

}

Page 16: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

GPU implementation

Store all data arrays entirely in GPU memory

— Reduce number of PCI-E transfers to minimum

— Map 3D arrays to linear memory

Main kernel

— Build matrix coefficients

— Solve tridiagonal systems

(X, Y, Z)

Z + Y * dimZ + X * dimY * dimZ

Z – fastest-changing dimension

Page 17: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Building matrices

Input data:

— Previous/non-linear 3D layers

Each thread computes:

— Coefficients of a tridiagonal matrix

— Right-hand side vector

Use C++ templates for direction and equation

a b c

d

Page 18: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Building matrices – performance

Poor Z direction performance compared to X/Y

— Threads access contiguous memory region

— Memory access is uncoalesced, lots of cache misses

Tesl

a C

2050 (

SP)

sec

0.0

0.5

1.0

1.5

2.0

Build Build + Solve

X dir

Y dir

Z dir

Dir Requests

per warp

L1 global

load hit %

IPC

X 2 – 3 25 – 45 1.4

Y 2 – 3 33 – 44 1.4

Z 32 0 – 15 0.2

Build kernels Total time

Page 19: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Building matrices – optimization

Run Z phase in transposed XZY space

— Better locality for memory accesses

— Additional overhead on transpose

XYZ XYZ

X local iterations Y local iterations Z local iterations

Transpose input arrays

Transpose output arrays

Y local iterations

XZY XZY

Page 20: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Building matrices - optimization

Tridiagonal solver time dominates over transpose

— Transpose will takes less % with more local iterations

0.0

0.5

1.0

1.5

2.0

X dir Y dir Z dir Z dirOPT

Transpose

Build + Solve

Tesl

a C

2050 (

SP)

sec

2.5x

Total time

Z dir Requests

per warp

L1 global

load hit %

IPC

Original 32 0 – 15 0.2

Transposed 2 – 3 30 – 38 1.3

Build kernels

Page 21: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Solving tridiagonal systems

Number of tridiagonal systems ~ grid size squared

Sweep algorithm is the most efficient in this case

— 1 thread solves 1 system

for( int p = 1; p < end; p++ ) {

// .. compute tridiagonal coefficients a_val, b_val, c_val, d_val ..

get(c,p) = c_val / (b_val - a_val * get(c,p-1));

get(d,p) = (d_val - get(d,p-1) * a_val) / (b_val - a_val * get(c,p-1));

}

for( int i = end-1; i >= 0; i-- )

get(x,i) = get(d,i) - get(c,i) * get(x, i+1);

Page 22: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Solving tridiagonal systems

Matrix layout is crucial for performance

X, Y directions matrices are interleaved by default

Z is interleaved as well if doing in transposed space

Interleaved layout

a0 a0 a0 a0 a1 a1 a1 a1

Sweep friendly Thre

ad 1

Thre

ad 2

Thre

ad 3

similar as ELLPACK

for sparse matrices

Page 23: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Solving tridiagonal systems

L1/L2 effect on performance

— Using 48K L1 instead of 16K gives 10-15% speed-up

— Turning L1 off reduces performance by 10%

— Really help on misaligned accesses and spatial reuse

Occupancy >= 50%

— Running 128 threads per block

— 26-42 registers per thread (different for u, v, w, T)

— No shared memory

Page 24: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Performance benchmark

CPU configuration:

— Intel Core i7-3930K CPU @ 3.2 GHz, 12 cores

— Use OpenMP for CPU parallelization

Mostly memory bandwidth bound

Some parts achieves ~4x speed-up vs 1 core

GPU configuration:

— NVIDIA Tesla C2070

Page 25: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Test cases

Box Pipe

Simple geometry

Systems of the same size

Need to compute in all rectangular grid points

Y

X

X

Y

Z 1

1

L

Page 26: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Test cases

White Sea

Complex geometry

Big divergence for system sizes

Need to compute only inside the area

Y

X

Page 27: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Performance results – Box Pipe

Grid 128x128x128

0

500

1000

1500

2000

2500

Solve X Solve Y Solve Z Total

CPU

GPU

0

500

1000

1500

2000

2500

Solve X Solve Y Solve Z Total

CPU

GPU

SINGLE DOUBLE segments/ms segments/ms

9.3x

8.4x

Page 28: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Performance results – White Sea

Grid 256x192x160

SINGLE DOUBLE segments/ms segments/ms

0

500

1000

1500

2000

2500

3000

3500

4000

Solve X Solve Y Solve Z Total

CPU

GPU

0

500

1000

1500

2000

2500

3000

3500

4000

Solve X Solve Y Solve Z Total

CPU

GPU10.3x

9.5x

Page 29: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Outline

Fluid Simulation in 3D domain

Multi-GPU implementation

— General splitting algorithm

— Running computations using CUDA

— Benchmarking and performance analysis

— Improving weak scaling

Page 30: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Multi-GPU motivation

Limited available amount of memory

— 3D arrays: grid, temporary arrays, matrices

— Max size of grid that can fit into Tesla M2050 ~ 2243

Distribute the computations between multiple GPUs and

multiple nodes

— Can compute large grids

— Speed-up computations

Page 31: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Main Idea of mGPU

Systems along Y/Z are solved independently in

parallel on each GPU

— No data transfer

Along X data must be synchronized

X Y

Z

GPU 0 GPU 1 GPU 2

Computing alternating directions:

X Y Z

Page 32: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

CUDA - parallelization

Split the grid along X (the longest stride)

Z + Y * dimZ + X * dimY * dimZ

Launch kernels on several GPUs from one host thread

Data transfer

— Async P2P through PCI-E (cudaMemcpyPeerAsync)

for (int i = 0; i < numDev; i++)

{

cudaSetDevice(i); //Switch device

kernel<<<…>>>(devArray[i], ..); //Computation

}

CUDA 4.x

Page 33: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Synchronization of Nonlinear Layer

• High aggregate throughput on 8 GPU system

• Communication impact Is not significant

for (int i = 0; i < numDev-1; i++)

cudaMemcpyPeerrAsync(dHaloLeft[i+1], i+1, dDataRight[i], i, num_bytes, devStream[i]);

// might need multidev synchronization here

for (int i = 1; i < numDev; i++)

cudaMemcpyPeerAsync(dHaloRight[i-1], i-1, dDataLeft[i], i, num_bytes, devStream[i]);

Page 34: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Solve X (tridiagonal solver)

GPU 0 GPU 1 GPU 1

bound partially bound unbound halo

Page 35: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Solve X (tridiagonal solver)

• Process bound segments without intercommunication

• Interleave segments for better memory access – one segment per thread

• Align to the left

• Gauss elimination

• Communicate Forward

Backward

Page 36: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Solve X X

Y

Z

Split the grid (“long X”)

• Array[i*dimz*dimy+…]

• Allocation of layers in

mGPU

• 3D segment analysis

Page 37: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Solve X X

Y

Z

Split the grid (“long X”)

• Array[i*dimz*dimy+…]

• Allocation of layers in

mGPU

Forward sweep along X

Active

GPU

Page 38: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Solve X X

Y

Z

Split the grid (“long X”)

• Array[i*dimz*dimy+…]

• Allocation of layers in

mGPU

Forward sweep along X

Active

GPU

Page 39: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Solve X X

Y

Z Active

GPU

Split the grid (“long X”)

• Array[i*dimz*dimy+…]

• Allocation of layers in

mGPU

Back sweep along X

Page 40: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Solve X X

Y

Z Active

GPU

Split the grid (“long X”)

• Array[i*dimz*dimy+…]

• Allocation of layers in

mGPU

Back sweep along X

Page 41: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Solve X

Active

GPU

X Y

Z

Split the grid (“long X”)

• Array[i*dimz*dimy+…]

• Allocation of layers in

mGPU

Back sweep along X

Result:

No speedup along X

Page 42: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Benchmarks

Multiple GPU: 8 Tesla M2050 with P2P

Multiple Nodes: 4 InfiniBand MPI nodes, 1 Tesla M2090 each

Sample tests:

Box Pipe

White Sea

X

Y

Z 1

1

L

Page 43: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Results: 8 GPU, 1 MPI node

0

5

10

15

20

25

30

35

Total

Millions

poin

ts p

er

sec

White Sea

1 2 4 8

0

50

100

150

200

250

Total

Millions

poin

ts p

er

sec

Box Pipe

1 2 4 8

x4.5

x1.4

Tesla M2050 Grid 2243

x2.9 x1.35

Page 44: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

1 GPU Efficiency

0

20000

40000

60000

80000

0 64 128 192 256

Poin

ts /

ms

Grid size

Box Pipe

0

5000

10000

15000

20000

0 64 128 192 256

Poin

ts /

ms

Grid size

White Sea

Tesla M2090

Estimate amount of work per

GPU in 8xGPU system using

single GPU:

Box Pipe – enough

work for single GPU

White Sea – takes

about 5% of volume of

the grid. Grid size of

1283 is not enough.

2563/8 = 1283

Page 45: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Results: 1 GPU, 4 MPI nodes

0

5

10

15

20

25

30

35

Total

Millions

poin

ts p

er

sec

White Sea

1 2 4

0

20

40

60

80

100

120

140

160

180

200

Total

Millions

poin

ts p

er

sec

Box Pipe

1 2 4

x2.8

x1.2

Tesla M2090

Page 46: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Load Balancing

0

200

400

600

800

1000

1200

1400

0 72 144 216 288

Segm

ents

x

Y(x) + Z(X) + X(x)dX

X splitting criteria:

— Equal volumes

— Equal number of segments

Performance benefit

observed: up to 15.5%

Tesla M2090

Page 47: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

0

10

20

SweepX SweepY SweepZ Transpose

Tim

e,

ms Even X GPU 0

GPU 1

GPU 2

GPU 3

0

10

20

SweepX SweepY SweepZ Transpose

Tim

e,

ms Even Segments GPU 0

GPU 1

GPU 2

GPU 3

0

10

20

SweepX SweepY SweepZ Transpose

Tim

e,

ms Even Volumes GPU 0

GPU 1

GPU 2

GPU 3

ttotal= 47.3

ttotal= 44.3

ttotal= 44.4 Tesla M2090

Load Balancing. White Sea (288x320x320)

Page 48: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Analysis

All parts of the solver but one (Gauss elimination along X)

are fully parallel

Communication (using P2P + InfiniBand) is not a big issue for

given problem size

Bad weak scaling

Use blocks to hide latency for X sweeps

Page 49: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Improved Solve X X

Y

Z

Split the grid (“long X”)

• Array[i*dimz*dimy+…]

• Allocation of layers in

mGPU

• 3D segment analysis

GPU0 GPU1 GPU2

Page 50: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Improved Solve X X

Y

Z

GPU0 GPU1 GPU2 B0

B1

B2

B3

B4

Splitting the grid to XY

blocks along Z direction

• Segments sorting

• Sweep through all

scalar fields at once

Page 51: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Improved Solve X X

Y

Z

B0

B1

B2

B3

B4

Splitting the grid to XY

blocks along Z direction

• Segments sorting

• Sweep through all

scalar fields at once

Forward sweep along X,

Async halo send forward

Page 52: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Improved Solve X X

Y

Z

B0

B1

B2

B3

B4

Splitting the grid to XY

blocks along Z direction

• Segments sorting

• Sweep through all

scalar fields at once

Forward sweep along X,

Async halo send forward

Move to the next block

group

Page 53: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Improved Solve X X

Y

Z

B0

B1

B2

B3

B4

Splitting the grid to XY

blocks along Z direction

• Segments sorting

• Sweep through all

scalar fields at once

Forward sweep along X,

Async halo send forward

Move to the next block

group

Backward sweep along X,

Async halo send backward

Page 54: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Improved Solve X X

Y

Z

B0

B1

B2

B3

B4

Splitting the grid to XY

blocks along Z direction

• Segments sorting

• Sweep through all

scalar fields at once

Forward sweep along X,

Async halo send forward

Move to the next block

group

Backward sweep along X,

Async halo send backward

Page 55: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Improved Solve X X

Y

Z

B0

B1

B2

B3

B4

Splitting the grid to XY

blocks along Z direction

• Segments sorting

• Sweep through all

scalar fields at once

Forward sweep along X,

Async halo send forward

Move to the next block

group

Backward sweep along X,

Async halo send backward

Equal work per node!

Page 56: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Algorithm

2(𝑁𝑛𝑜𝑑𝑒𝑠 − 𝑖𝑛𝑜𝑑𝑒 − 1)

𝑖𝑏𝑙𝑜𝑐𝑘 − 𝑖𝑛𝑜𝑑𝑒

𝑛𝑜𝑑𝑒0 𝑖𝑛𝑜𝑑𝑒 𝑁𝑛𝑜𝑑𝑒𝑠 … … 𝑏𝑙𝑜𝑐𝑘0

𝑖𝑏𝑙𝑜𝑐𝑘

X Y

Z

𝑁𝑍𝑁𝑏𝑙𝑜𝑐𝑘𝑠

receive 𝑋𝑖𝑛𝑜𝑑𝑒−1

receive 𝑋𝑖𝑛𝑜𝑑𝑒+1

cudaStream1

cudaStream2

Page 57: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Algorithm

2(𝑁𝑛𝑜𝑑𝑒𝑠 − 𝑖𝑛𝑜𝑑𝑒 − 1)

𝑖𝑏𝑙𝑜𝑐𝑘 − 𝑖𝑛𝑜𝑑𝑒

𝑛𝑜𝑑𝑒0 𝑖𝑛𝑜𝑑𝑒 𝑁𝑛𝑜𝑑𝑒𝑠 … … 𝑏𝑙𝑜𝑐𝑘0

𝑖𝑏𝑙𝑜𝑐𝑘

X Y

Z

𝑁𝑍𝑁𝑏𝑙𝑜𝑐𝑘𝑠

cudaStream1

cudaStream2

Forward

Backward

Page 58: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Algorithm

2(𝑁𝑛𝑜𝑑𝑒𝑠 − 𝑖𝑛𝑜𝑑𝑒 − 1)

𝑖𝑏𝑙𝑜𝑐𝑘 − 𝑖𝑛𝑜𝑑𝑒

𝑛𝑜𝑑𝑒0 𝑖𝑛𝑜𝑑𝑒 𝑁𝑛𝑜𝑑𝑒𝑠 … … 𝑏𝑙𝑜𝑐𝑘0

𝑖𝑏𝑙𝑜𝑐𝑘

X Y

Z

𝑁𝑍𝑁𝑏𝑙𝑜𝑐𝑘𝑠

send 𝑋𝑖𝑛𝑜𝑑𝑒

cudaStream1

cudaStream2

send 𝑋𝑖𝑛𝑜𝑑𝑒

Page 59: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Improved Solve XY X

Y

Z

B0

B1

B2

B3

B4

Y blocks

Separate buffer for Y

sweeps

Block Y sweeps are

performed independently

in separate cudaStreams

Helps with data

transfer/compute overlap

Page 60: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Weak Scaling

100

150

200

250

300

350

400

0 2 4 6 8 10

Tim

e,

ms

Number of GPUs

Average time for Solve XYZ

Box Pipe

Grids:

2243, 2883, 3523, 4483

Tesla M2050

Page 61: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Big Systems Limit

0

50

100

150

200

250

1 2 4 8 16 32

Tim

e,

ms

Number of blocks

Average time for Solve XYZ Consider on scalar field:

no physics, more

available RAM

8 M2050 GPUs

Grid: 7683

With larger grid sizes, curve

minimum shifts down/right

Page 62: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Conclusions

GPU outperforms multi-core CPU over 10x factor

GPU works well with complex input domains

Performance and scaling factors heavily depend on input

geometry and size of grid

— Efficient work distribution methods are essential for performance

Using block-splitting for ADI improves scaling factor by

hiding dependency of sweep processing

Page 63: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Future work

Test on large scale systems

— Potentially on “Lomonosov” supercomputer at MSU

— GPU part with peak performance of 863 TFlops

Memory usage optimizations

Explore different tridiagonal approaches

Page 64: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole

Questions?

Thank You !

Page 65: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole
Page 66: 3D ADI Method for Fluid Simulation on Multiple GPUson-demand.gputechconf.com/gtc/2012/presentations/S0247-3D-ADI... · ADI method – iterations Use global iterations for the whole