introducing gpus to a commercial reservoir simulator · 2015. 3. 19. · introducing gpus to a...
TRANSCRIPT
Introducing GPUs to a Commercial Reservoir Simulator
Dominic Walsh, Paul Woodhams, Schlumberger
Yongpeng Zhang, Ken Esler, Stone Ridge Technology Inc.
Reservoir Simulation • Purpose: Estimate reserves, prediction of
optimal recovery and production strategy
• Input: rock and fluid well data, production history
• Model size: 104 cells (laptops) - 109 cells (Linux clusters)
• Uncertainty: multiple realizations
• Embedded: NetworkPlantEconomics
Very computationally demanding
0
500
1000
1500
2000
2011 2012 2013 2014
Client Model Size (Millions Cells)
History and Problem Size • GPU have been very successful in the Seismic domain
• Seismic clusters are acquiring GPU & Infiniband simulation ready
• Clients are being constrained by power envelopes
• New GPU Simulator? • ECLIPSE (circa 1984) is the industry standard • INTERSECT (circa 2010) is the “high fidelity” simulator • Testing and validation is measured in Man Decades • User base migration is expected to take 5-10 yr. timeframe
Can we take advantage of new GPU hardware while preserving this investment?
Structure
• Reservoir: • Deposition: Semi-structured grids • Finite Volume: Low order stencils • Static structure • Time Stepping: Implicit and
adaptive • Up to Billions of cells
• Wells
• Pipe flow • Introduce local structure • Up to 105 wells
Reservoir Grid and Wells
Irregularity and Nonlinearity
• Many small tightly-coupled sub-problems
• Time varying structure
Terminal (THP)
Internal
constraint
Inflow model
Flow node
Phase Envelope
• Complicated Fluid and Phase Modelling • Per-cell nonlinear systems
• Possibly non-reversible rock models
Well Structure
Phase I: Thermal Linear Solver
Code volume
Small problem size
Fully Implicit
Windows workstation
Amdahl
• Thermal • Single Box • Linear Solver
0.21 0.44 4.02 8.22
71.55
10.19
FM% Reporting% Properties% Matrix% Linear Solver% Other%
THERMAL 16X Parallel Runtime %
0.00%
50.00%
100.00%
Lines of Code
Solver
Engine
Test Model: THERM
Small: 1 M Cells & 9 Well Pairs
Thermal: CH4 + Bitumen
2.5 yrs. steam injection
very strong transitions
Numerically very demanding
Property Distribution Solution Distribution
Linear Solver Big Picture
Multigrid
ILU
SpMV & Orthog.
FGMRes:
• Iterative Solver with composite preconditioner: • Constrained Pressure Residual (CPR) method • Block 4x4 and Scalar Pressure systems
Iterative solver: FGMRes
Composite preconditioner Multigrid Pressure only
GAMPACK ? ILU full system
• Multi-color SpMV & Orthogonalisation
4x4 1x1
Parallel Forall Blog - HPCG Everett Phillips
Preliminary indicators
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
IX 16X MKL 16X HYBRID K20X
Good ~5X Challenge† ~0.9X-0.75X
0%
20%
40%
60%
80%
100%
120%
140%
Nu
me
r o
f Li
ne
ar It
era
tio
ns
Sparse Matrix Multiply Algorithmic Weakening
CPU GPU AMG
GPU ILU
† Fine-Grained Parallel Preconditioners for Fast GPU-based Solvers, Dimitar Lukarski GTC 2012 High Performance Algebraic Multigrid for Commercial Applications, Jonathan Cohen GTC 2013
Offload: MPI & Multiplexing • INTERSECT:
• MPI process per domain
• Device shared memory? • only Linux • not windows
• Use threads to drive multiple cards
• C++ NOT OpenMP • CUDA 7
• Transfer:
• Stage on Host side • Pinned
Proc 0
Proc 7
……
..
SHMEM
GPU 0
GPU 1
Driver
16 IX Processes
Sleeping Threads Sleeping Processes
One Thread/GPU
Sleeping Threads
16 IX Processes
Standard Parallel IX
Multithreaded GPU
Solver
Standard Parallel IX
Time
Transfer Cost
• Transfer cost is a significant fraction of complete CPU solve
• Naïve implementation not sufficient
Props Jacobian
Solve
CPU:
GPU:
Props
Jacobian
Time
Matrix PCI bus
Solution (small)
Setup
Setup
Overlapping & CPR • CPR is a composite preconditioner!
– Pressure is 1/16th – AMG: small but costly – Second stage is relatively cheap
• Use streams
– per matrix – per thread/GPU
• Lambda’s in CUDA 7
• Use mixed precision
Time
Props Jacobian
Solve
CPU:
GPU:
Props
Jacobian
Pressure is 1/16th size
Solution (small)
Setup
Setup
THERM Results
Good Solver speedup Still carrying a lot of non-solver time Marginal benefit
0
2000
4000
6000
8000
10000
12000
16X CPU "+K40" "+2xK40" "+4xK20X"
Linear Solver Solve s Linear Solver Setup s Not Linear Solver
0
0.5
1
1.5
2
2.5
3
3.5
16X CPU "+K40" "+2xK40" "+4xK20X"
Linear Solver
Sp
eed
up
Elapsed Time
Seco
nd
s
Larger Model: THERM_L (4M)
Better solver speedup More work on cards
0
2
4
6
16X CPU "+4xM2090" "+4xK40"
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
16X CPU "+4xM2090" "+4xK40"
Not Linear Solver Linear Solver Setup s Linear Solver Solve s
Bigger impact
Linear Solver
Spee
du
p
Seco
nd
s
Elapsed Time
0
5
10
15
20
25
30
16 32 48 16 +4M2090 16+4K40
THERM_L CPU Elapsed (hr.)
Implications: GPU vs CPU
Currently need 48 nodes to match GPU performance Increased CPU’s does not speedup
Amdahl
THERM_L Strong Scaling
0.0
5.0
10.0
15.0
20.0
25.0
30.0
16X CPU 4xM2090 4xK40 16 M2090 4 K40 16 K40 2 K80 4 K80 8 K80 16 K80
Lin
ear
So
lve
r Sp
ee
du
p
CUDASolver V1 Single Node
CUDAsolver V2 MPI Cluster Ready
Single-Node
Multi-Node
26X Linear Solver speedup
THERM_XL(16M): Strong Scaling
0
5000
10000
15000
20000
25000
60MPI 120MPI_II 240MPI_II 320MPI 480MPI 54MPI+18K40 108MPI+36K40 180+60K40 216MPI+72K40
CPU vs. GPU Linear Solver Time
Compute Density
0
2
4
6
8
0 2 4 6 8 10 12 14 16 18 20
Lin
ear
So
lve
r ti
me
(h
rs)
Number of Nodes
CPU Best Time
GPU Worst Time
GPU Best Time
2.2X 5.4X
4.7X More Nodes
Compute Density
0
2
4
6
8
0 2 4 6 8 10 12 14 16 18 20
Lin
ear
So
lve
r ti
me
(h
rs)
Number of Nodes
CPU Best Time
GPU Worst Time
GPU Best Time
2.2X 5.4X
4.7X More Nodes
Next Steps • Commercialize current solution • Lessons learnt CPU Solver • Cluster hardware implications?
• Linear Solver is not enough extend GPU
• Wells too small & too complicated, remain on CPU • Reservoir
• Jacobian construction • Property calculation
• Requirements:
• Single code base: OpenACC?, Custom? • Overlapping rework
Thanks and Acknowledgements
Many thanks to Jonathan Cohen, Julian Demouth, Patrice Castonguay, Justin Luitjens, Ken Hester & Doug Holt
Questions?