3d adi method for fluid simulation on multiple...
TRANSCRIPT
3D ADI Method for Fluid Simulation on Multiple GPUs
Nikolai Sakharnykh, NVIDIA
Nikolay Markovskiy, NVIDIA
Introduction
Fluid simulation using direct numerical methods
— Gives the most accurate result
— Requires lots of memory and computational power
GPUs are very suitable for direct methods
— Have great instruction throughput and high memory bandwidth
How will it scale on multiple GPUs?
cmc-fluid-solver
Open source project on Google Code
— Started at CMC faculty of MSU, Russia
— CPU: OpenMP, GPU: CUDA
3D fluid simulation using ADI solver
Key people:
— MSU: Vilen Paskonov, Sergey Berezin
— NVIDIA: Nikolay Sakharnykh, Nikolay Markovskiy
Outline
Fluid Simulation in 3D domain
— Problem statement, applications
— ADI numerical method
— GPU implementation details, optimizations
— Performance analysis
Multi-GPU implementation
Problem Statement
Viscid incompressible fluid in 3D domain
Arbitrary closed geometry for boundaries
Euler coordinates: velocity and temperature
free
injection
no-slip
Applications
Sea and ocean simulation
— Additional parameters: salinity, etc.
Low-speed gas flow
— Inside 3D channel
— Around objects
Definitions
Equation of state
— Describe relation between and
— Example:
Density
Velocity
Temperature
Pressure
– gas constant for air
Governing equations
Continuity equation
— For incompressible fluids:
Navier-Stokes equations:
— Dimensionless form, use equation of state
– Reynolds number (= inertia/viscosity ratio)
Governing equations
Energy equation:
— Dimensionless form, use equation of state
– heat capacity ratio
– Prandtl number
– dissipative function
ADI numerical method
X Y
Z
Fixed Y, Z Fixed X, Z Fixed X, Y
X Y Z
ADI numerical method
Benefits
— Doesn’t have hard requirements on time step
— Domain decomposition – each step can be well parallelized
Many applications
— Computational Fluid Dynamics
— Computational Finance
Linear 3D PDE
ADI method – iterations
Use global iterations for the whole system of equations
Some equations are not linear:
— Use local iterations to approximate the non-linear term
previous
time step
Solve X-dir equations
Solve Y-dir equations
Solve Z-dir equations
Updating all variables next
time step global iterations
Discretization
Use regular grid, implicit finite difference scheme
— Second order in space
— First order in time
Leads to a tridiagonal system for
— Independent system for each fixed pair (j, k)
Need to solve lots of tridiagonal systems
Sizes of systems may vary across the grid
Tridiagonal systems
Outside cell
Inside cell
Boundary cell
system 1
system 2
system 3
Implementation details
<for each direction X, Y, Z>
{
<for each local iteration>
{
<for each equation u, v, w, T>
{
build tridiagonal matrices and rhs
solve tridiagonal systems
}
update non-linear terms
}
}
GPU implementation
Store all data arrays entirely in GPU memory
— Reduce number of PCI-E transfers to minimum
— Map 3D arrays to linear memory
Main kernel
— Build matrix coefficients
— Solve tridiagonal systems
(X, Y, Z)
Z + Y * dimZ + X * dimY * dimZ
Z – fastest-changing dimension
Building matrices
Input data:
— Previous/non-linear 3D layers
Each thread computes:
— Coefficients of a tridiagonal matrix
— Right-hand side vector
Use C++ templates for direction and equation
a b c
d
Building matrices – performance
Poor Z direction performance compared to X/Y
— Threads access contiguous memory region
— Memory access is uncoalesced, lots of cache misses
Tesl
a C
2050 (
SP)
sec
0.0
0.5
1.0
1.5
2.0
Build Build + Solve
X dir
Y dir
Z dir
Dir Requests
per warp
L1 global
load hit %
IPC
X 2 – 3 25 – 45 1.4
Y 2 – 3 33 – 44 1.4
Z 32 0 – 15 0.2
Build kernels Total time
Building matrices – optimization
Run Z phase in transposed XZY space
— Better locality for memory accesses
— Additional overhead on transpose
XYZ XYZ
X local iterations Y local iterations Z local iterations
Transpose input arrays
Transpose output arrays
Y local iterations
XZY XZY
Building matrices - optimization
Tridiagonal solver time dominates over transpose
— Transpose will takes less % with more local iterations
0.0
0.5
1.0
1.5
2.0
X dir Y dir Z dir Z dirOPT
Transpose
Build + Solve
Tesl
a C
2050 (
SP)
sec
2.5x
Total time
Z dir Requests
per warp
L1 global
load hit %
IPC
Original 32 0 – 15 0.2
Transposed 2 – 3 30 – 38 1.3
Build kernels
Solving tridiagonal systems
Number of tridiagonal systems ~ grid size squared
Sweep algorithm is the most efficient in this case
— 1 thread solves 1 system
for( int p = 1; p < end; p++ ) {
// .. compute tridiagonal coefficients a_val, b_val, c_val, d_val ..
get(c,p) = c_val / (b_val - a_val * get(c,p-1));
get(d,p) = (d_val - get(d,p-1) * a_val) / (b_val - a_val * get(c,p-1));
}
for( int i = end-1; i >= 0; i-- )
get(x,i) = get(d,i) - get(c,i) * get(x, i+1);
Solving tridiagonal systems
Matrix layout is crucial for performance
X, Y directions matrices are interleaved by default
Z is interleaved as well if doing in transposed space
Interleaved layout
a0 a0 a0 a0 a1 a1 a1 a1
Sweep friendly Thre
ad 1
Thre
ad 2
Thre
ad 3
similar as ELLPACK
for sparse matrices
Solving tridiagonal systems
L1/L2 effect on performance
— Using 48K L1 instead of 16K gives 10-15% speed-up
— Turning L1 off reduces performance by 10%
— Really help on misaligned accesses and spatial reuse
Occupancy >= 50%
— Running 128 threads per block
— 26-42 registers per thread (different for u, v, w, T)
— No shared memory
Performance benchmark
CPU configuration:
— Intel Core i7-3930K CPU @ 3.2 GHz, 12 cores
— Use OpenMP for CPU parallelization
Mostly memory bandwidth bound
Some parts achieves ~4x speed-up vs 1 core
GPU configuration:
— NVIDIA Tesla C2070
Test cases
Box Pipe
Simple geometry
Systems of the same size
Need to compute in all rectangular grid points
Y
X
X
Y
Z 1
1
L
Test cases
White Sea
Complex geometry
Big divergence for system sizes
Need to compute only inside the area
Y
X
Performance results – Box Pipe
Grid 128x128x128
0
500
1000
1500
2000
2500
Solve X Solve Y Solve Z Total
CPU
GPU
0
500
1000
1500
2000
2500
Solve X Solve Y Solve Z Total
CPU
GPU
SINGLE DOUBLE segments/ms segments/ms
9.3x
8.4x
Performance results – White Sea
Grid 256x192x160
SINGLE DOUBLE segments/ms segments/ms
0
500
1000
1500
2000
2500
3000
3500
4000
Solve X Solve Y Solve Z Total
CPU
GPU
0
500
1000
1500
2000
2500
3000
3500
4000
Solve X Solve Y Solve Z Total
CPU
GPU10.3x
9.5x
Outline
Fluid Simulation in 3D domain
Multi-GPU implementation
— General splitting algorithm
— Running computations using CUDA
— Benchmarking and performance analysis
— Improving weak scaling
Multi-GPU motivation
Limited available amount of memory
— 3D arrays: grid, temporary arrays, matrices
— Max size of grid that can fit into Tesla M2050 ~ 2243
Distribute the computations between multiple GPUs and
multiple nodes
— Can compute large grids
— Speed-up computations
Main Idea of mGPU
Systems along Y/Z are solved independently in
parallel on each GPU
— No data transfer
Along X data must be synchronized
X Y
Z
GPU 0 GPU 1 GPU 2
Computing alternating directions:
X Y Z
CUDA - parallelization
Split the grid along X (the longest stride)
Z + Y * dimZ + X * dimY * dimZ
Launch kernels on several GPUs from one host thread
Data transfer
— Async P2P through PCI-E (cudaMemcpyPeerAsync)
for (int i = 0; i < numDev; i++)
{
cudaSetDevice(i); //Switch device
kernel<<<…>>>(devArray[i], ..); //Computation
}
CUDA 4.x
Synchronization of Nonlinear Layer
• High aggregate throughput on 8 GPU system
• Communication impact Is not significant
for (int i = 0; i < numDev-1; i++)
cudaMemcpyPeerrAsync(dHaloLeft[i+1], i+1, dDataRight[i], i, num_bytes, devStream[i]);
// might need multidev synchronization here
for (int i = 1; i < numDev; i++)
cudaMemcpyPeerAsync(dHaloRight[i-1], i-1, dDataLeft[i], i, num_bytes, devStream[i]);
Solve X (tridiagonal solver)
GPU 0 GPU 1 GPU 1
bound partially bound unbound halo
Solve X (tridiagonal solver)
• Process bound segments without intercommunication
• Interleave segments for better memory access – one segment per thread
• Align to the left
• Gauss elimination
• Communicate Forward
Backward
Solve X X
Y
Z
Split the grid (“long X”)
• Array[i*dimz*dimy+…]
• Allocation of layers in
mGPU
• 3D segment analysis
Solve X X
Y
Z
Split the grid (“long X”)
• Array[i*dimz*dimy+…]
• Allocation of layers in
mGPU
Forward sweep along X
Active
GPU
Solve X X
Y
Z
Split the grid (“long X”)
• Array[i*dimz*dimy+…]
• Allocation of layers in
mGPU
Forward sweep along X
Active
GPU
Solve X X
Y
Z Active
GPU
Split the grid (“long X”)
• Array[i*dimz*dimy+…]
• Allocation of layers in
mGPU
Back sweep along X
Solve X X
Y
Z Active
GPU
Split the grid (“long X”)
• Array[i*dimz*dimy+…]
• Allocation of layers in
mGPU
Back sweep along X
Solve X
Active
GPU
X Y
Z
Split the grid (“long X”)
• Array[i*dimz*dimy+…]
• Allocation of layers in
mGPU
Back sweep along X
Result:
No speedup along X
Benchmarks
Multiple GPU: 8 Tesla M2050 with P2P
Multiple Nodes: 4 InfiniBand MPI nodes, 1 Tesla M2090 each
Sample tests:
Box Pipe
White Sea
X
Y
Z 1
1
L
Results: 8 GPU, 1 MPI node
0
5
10
15
20
25
30
35
Total
Millions
poin
ts p
er
sec
White Sea
1 2 4 8
0
50
100
150
200
250
Total
Millions
poin
ts p
er
sec
Box Pipe
1 2 4 8
x4.5
x1.4
Tesla M2050 Grid 2243
x2.9 x1.35
1 GPU Efficiency
0
20000
40000
60000
80000
0 64 128 192 256
Poin
ts /
ms
Grid size
Box Pipe
0
5000
10000
15000
20000
0 64 128 192 256
Poin
ts /
ms
Grid size
White Sea
Tesla M2090
Estimate amount of work per
GPU in 8xGPU system using
single GPU:
Box Pipe – enough
work for single GPU
White Sea – takes
about 5% of volume of
the grid. Grid size of
1283 is not enough.
2563/8 = 1283
Results: 1 GPU, 4 MPI nodes
0
5
10
15
20
25
30
35
Total
Millions
poin
ts p
er
sec
White Sea
1 2 4
0
20
40
60
80
100
120
140
160
180
200
Total
Millions
poin
ts p
er
sec
Box Pipe
1 2 4
x2.8
x1.2
Tesla M2090
Load Balancing
0
200
400
600
800
1000
1200
1400
0 72 144 216 288
Segm
ents
x
Y(x) + Z(X) + X(x)dX
X splitting criteria:
— Equal volumes
— Equal number of segments
Performance benefit
observed: up to 15.5%
Tesla M2090
0
10
20
SweepX SweepY SweepZ Transpose
Tim
e,
ms Even X GPU 0
GPU 1
GPU 2
GPU 3
0
10
20
SweepX SweepY SweepZ Transpose
Tim
e,
ms Even Segments GPU 0
GPU 1
GPU 2
GPU 3
0
10
20
SweepX SweepY SweepZ Transpose
Tim
e,
ms Even Volumes GPU 0
GPU 1
GPU 2
GPU 3
ttotal= 47.3
ttotal= 44.3
ttotal= 44.4 Tesla M2090
Load Balancing. White Sea (288x320x320)
Analysis
All parts of the solver but one (Gauss elimination along X)
are fully parallel
Communication (using P2P + InfiniBand) is not a big issue for
given problem size
Bad weak scaling
Use blocks to hide latency for X sweeps
Improved Solve X X
Y
Z
Split the grid (“long X”)
• Array[i*dimz*dimy+…]
• Allocation of layers in
mGPU
• 3D segment analysis
GPU0 GPU1 GPU2
Improved Solve X X
Y
Z
GPU0 GPU1 GPU2 B0
B1
B2
B3
B4
Splitting the grid to XY
blocks along Z direction
• Segments sorting
• Sweep through all
scalar fields at once
Improved Solve X X
Y
Z
B0
B1
B2
B3
B4
Splitting the grid to XY
blocks along Z direction
• Segments sorting
• Sweep through all
scalar fields at once
Forward sweep along X,
Async halo send forward
Improved Solve X X
Y
Z
B0
B1
B2
B3
B4
Splitting the grid to XY
blocks along Z direction
• Segments sorting
• Sweep through all
scalar fields at once
Forward sweep along X,
Async halo send forward
Move to the next block
group
Improved Solve X X
Y
Z
B0
B1
B2
B3
B4
Splitting the grid to XY
blocks along Z direction
• Segments sorting
• Sweep through all
scalar fields at once
Forward sweep along X,
Async halo send forward
Move to the next block
group
Backward sweep along X,
Async halo send backward
Improved Solve X X
Y
Z
B0
B1
B2
B3
B4
Splitting the grid to XY
blocks along Z direction
• Segments sorting
• Sweep through all
scalar fields at once
Forward sweep along X,
Async halo send forward
Move to the next block
group
Backward sweep along X,
Async halo send backward
Improved Solve X X
Y
Z
B0
B1
B2
B3
B4
Splitting the grid to XY
blocks along Z direction
• Segments sorting
• Sweep through all
scalar fields at once
Forward sweep along X,
Async halo send forward
Move to the next block
group
Backward sweep along X,
Async halo send backward
Equal work per node!
Algorithm
2(𝑁𝑛𝑜𝑑𝑒𝑠 − 𝑖𝑛𝑜𝑑𝑒 − 1)
𝑖𝑏𝑙𝑜𝑐𝑘 − 𝑖𝑛𝑜𝑑𝑒
𝑛𝑜𝑑𝑒0 𝑖𝑛𝑜𝑑𝑒 𝑁𝑛𝑜𝑑𝑒𝑠 … … 𝑏𝑙𝑜𝑐𝑘0
…
𝑖𝑏𝑙𝑜𝑐𝑘
…
X Y
Z
𝑁𝑍𝑁𝑏𝑙𝑜𝑐𝑘𝑠
receive 𝑋𝑖𝑛𝑜𝑑𝑒−1
receive 𝑋𝑖𝑛𝑜𝑑𝑒+1
cudaStream1
cudaStream2
Algorithm
2(𝑁𝑛𝑜𝑑𝑒𝑠 − 𝑖𝑛𝑜𝑑𝑒 − 1)
𝑖𝑏𝑙𝑜𝑐𝑘 − 𝑖𝑛𝑜𝑑𝑒
𝑛𝑜𝑑𝑒0 𝑖𝑛𝑜𝑑𝑒 𝑁𝑛𝑜𝑑𝑒𝑠 … … 𝑏𝑙𝑜𝑐𝑘0
…
𝑖𝑏𝑙𝑜𝑐𝑘
…
X Y
Z
𝑁𝑍𝑁𝑏𝑙𝑜𝑐𝑘𝑠
cudaStream1
cudaStream2
Forward
Backward
Algorithm
2(𝑁𝑛𝑜𝑑𝑒𝑠 − 𝑖𝑛𝑜𝑑𝑒 − 1)
𝑖𝑏𝑙𝑜𝑐𝑘 − 𝑖𝑛𝑜𝑑𝑒
𝑛𝑜𝑑𝑒0 𝑖𝑛𝑜𝑑𝑒 𝑁𝑛𝑜𝑑𝑒𝑠 … … 𝑏𝑙𝑜𝑐𝑘0
…
𝑖𝑏𝑙𝑜𝑐𝑘
…
X Y
Z
𝑁𝑍𝑁𝑏𝑙𝑜𝑐𝑘𝑠
send 𝑋𝑖𝑛𝑜𝑑𝑒
cudaStream1
cudaStream2
send 𝑋𝑖𝑛𝑜𝑑𝑒
Improved Solve XY X
Y
Z
B0
B1
B2
B3
B4
Y blocks
Separate buffer for Y
sweeps
Block Y sweeps are
performed independently
in separate cudaStreams
Helps with data
transfer/compute overlap
Weak Scaling
100
150
200
250
300
350
400
0 2 4 6 8 10
Tim
e,
ms
Number of GPUs
Average time for Solve XYZ
Box Pipe
Grids:
2243, 2883, 3523, 4483
Tesla M2050
Big Systems Limit
0
50
100
150
200
250
1 2 4 8 16 32
Tim
e,
ms
Number of blocks
Average time for Solve XYZ Consider on scalar field:
no physics, more
available RAM
8 M2050 GPUs
Grid: 7683
With larger grid sizes, curve
minimum shifts down/right
Conclusions
GPU outperforms multi-core CPU over 10x factor
GPU works well with complex input domains
Performance and scaling factors heavily depend on input
geometry and size of grid
— Efficient work distribution methods are essential for performance
Using block-splitting for ADI improves scaling factor by
hiding dependency of sweep processing
Future work
Test on large scale systems
— Potentially on “Lomonosov” supercomputer at MSU
— GPU part with peak performance of 863 TFlops
Memory usage optimizations
Explore different tridiagonal approaches
Questions?
Thank You !