introducing gpus to a commercial reservoir simulator · 2015. 3. 19. · introducing gpus to a...

Introducing GPUs to a Commercial Reservoir Simulator

Dominic Walsh, Paul Woodhams, Schlumberger

Yongpeng Zhang, Ken Esler, Stone Ridge Technology Inc.

Reservoir Simulation • Purpose: Estimate reserves, prediction of

optimal recovery and production strategy

• Input: rock and fluid well data, production history

• Model size: 104 cells (laptops) - 109 cells (Linux clusters)

• Uncertainty: multiple realizations

• Embedded: NetworkPlantEconomics

Very computationally demanding

0

500

1000

1500

2000

2011 2012 2013 2014

Client Model Size (Millions Cells)

History and Problem Size • GPU have been very successful in the Seismic domain

• Seismic clusters are acquiring GPU & Infiniband simulation ready

• Clients are being constrained by power envelopes

• New GPU Simulator? • ECLIPSE (circa 1984) is the industry standard • INTERSECT (circa 2010) is the “high fidelity” simulator • Testing and validation is measured in Man Decades • User base migration is expected to take 5-10 yr. timeframe

Can we take advantage of new GPU hardware while preserving this investment?

Structure

• Reservoir: • Deposition: Semi-structured grids • Finite Volume: Low order stencils • Static structure • Time Stepping: Implicit and

adaptive • Up to Billions of cells

• Wells

• Pipe flow • Introduce local structure • Up to 105 wells

Reservoir Grid and Wells

Irregularity and Nonlinearity

• Many small tightly-coupled sub-problems

• Time varying structure

Terminal (THP)

Internal

constraint

Inflow model

Flow node

Phase Envelope

• Complicated Fluid and Phase Modelling • Per-cell nonlinear systems

• Possibly non-reversible rock models

Well Structure

Phase I: Thermal Linear Solver

Code volume

Small problem size

Fully Implicit

Windows workstation

Amdahl

• Thermal • Single Box • Linear Solver

0.21 0.44 4.02 8.22

71.55

10.19

FM% Reporting% Properties% Matrix% Linear Solver% Other%

THERMAL 16X Parallel Runtime %

0.00%

50.00%

100.00%

Lines of Code

Solver

Engine

Test Model: THERM

Small: 1 M Cells & 9 Well Pairs

Thermal: CH4 + Bitumen

2.5 yrs. steam injection

very strong transitions

Numerically very demanding

Property Distribution Solution Distribution

Linear Solver Big Picture

Multigrid

ILU

SpMV & Orthog.

FGMRes:

• Iterative Solver with composite preconditioner: • Constrained Pressure Residual (CPR) method • Block 4x4 and Scalar Pressure systems

Iterative solver: FGMRes

Composite preconditioner Multigrid Pressure only

GAMPACK ? ILU full system

• Multi-color SpMV & Orthogonalisation

4x4 1x1

Parallel Forall Blog - HPCG Everett Phillips

http://devblogs.nvidia.com/parallelforall/optimizing-high-performance-conjugate-gradient-benchmark-gpus/







Preliminary indicators

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

IX 16X MKL 16X HYBRID K20X

Good ~5X Challenge† ~0.9X-0.75X

0%

20%

40%

60%

80%

100%

120%

140%

Nu

me

r o

f Li

ne

ar It

era

tio

ns

Sparse Matrix Multiply Algorithmic Weakening

CPU GPU AMG

GPU ILU

† Fine-Grained Parallel Preconditioners for Fast GPU-based Solvers, Dimitar Lukarski GTC 2012 High Performance Algebraic Multigrid for Commercial Applications, Jonathan Cohen GTC 2013

Offload: MPI & Multiplexing • INTERSECT:

• MPI process per domain

• Device shared memory? • only Linux • not windows

• Use threads to drive multiple cards

• C++ NOT OpenMP • CUDA 7

• Transfer:

• Stage on Host side • Pinned

Proc 0

Proc 7

……

..

SHMEM

GPU 0

GPU 1

Driver

16 IX Processes

Sleeping Threads Sleeping Processes

One Thread/GPU

Sleeping Threads

16 IX Processes

Standard Parallel IX

Multithreaded GPU

Solver

Standard Parallel IX

Time

Transfer Cost

• Transfer cost is a significant fraction of complete CPU solve

• Naïve implementation not sufficient

Props Jacobian

Solve

CPU:

GPU:

Props

Jacobian

Time

Matrix PCI bus

Solution (small)

Setup

Setup

Overlapping & CPR • CPR is a composite preconditioner!

– Pressure is 1/16th – AMG: small but costly – Second stage is relatively cheap

• Use streams

– per matrix – per thread/GPU

• Lambda’s in CUDA 7

• Use mixed precision

Time

Props Jacobian

Solve

CPU:

GPU:

Props

Jacobian

Pressure is 1/16th size

Solution (small)

Setup

Setup

THERM Results

Good Solver speedup Still carrying a lot of non-solver time Marginal benefit

0

2000

4000

6000

8000

10000

12000

16X CPU "+K40" "+2xK40" "+4xK20X"

Linear Solver Solve s Linear Solver Setup s Not Linear Solver

0

0.5

1

1.5

2

2.5

3

3.5

16X CPU "+K40" "+2xK40" "+4xK20X"

Linear Solver

Sp

eed

up

Elapsed Time

Seco

nd

s

Larger Model: THERM_L (4M)

Better solver speedup More work on cards

0

2

4

6

16X CPU "+4xM2090" "+4xK40"

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

16X CPU "+4xM2090" "+4xK40"

Not Linear Solver Linear Solver Setup s Linear Solver Solve s

Bigger impact

Linear Solver

Spee

du

p

Seco

nd

s

Elapsed Time

0

5

10

15

20

25

30

16 32 48 16 +4M2090 16+4K40

THERM_L CPU Elapsed (hr.)

Implications: GPU vs CPU

Currently need 48 nodes to match GPU performance Increased CPU’s does not speedup

Amdahl

THERM_L Strong Scaling

0.0

5.0

10.0

15.0

20.0

25.0

30.0

16X CPU 4xM2090 4xK40 16 M2090 4 K40 16 K40 2 K80 4 K80 8 K80 16 K80

Lin

ear

So

lve

r Sp

ee

du

p

CUDASolver V1 Single Node

CUDAsolver V2 MPI Cluster Ready

Single-Node

Multi-Node

26X Linear Solver speedup

THERM_XL(16M): Strong Scaling

0

5000

10000

15000

20000

25000

60MPI 120MPI_II 240MPI_II 320MPI 480MPI 54MPI+18K40 108MPI+36K40 180+60K40 216MPI+72K40

CPU vs. GPU Linear Solver Time

Compute Density

0

2

4

6

8

0 2 4 6 8 10 12 14 16 18 20

Lin

ear

So

lve

r ti

me

(h

rs)

Number of Nodes

CPU Best Time

GPU Worst Time

GPU Best Time

2.2X 5.4X

4.7X More Nodes

Next Steps • Commercialize current solution • Lessons learnt CPU Solver • Cluster hardware implications?

• Linear Solver is not enough extend GPU

• Wells too small & too complicated, remain on CPU • Reservoir

• Jacobian construction • Property calculation

• Requirements:

• Single code base: OpenACC?, Custom? • Overlapping rework

Thanks and Acknowledgements

Many thanks to Jonathan Cohen, Julian Demouth, Patrice Castonguay, Justin Luitjens, Ken Hester & Doug Holt

Questions?

introducing gpus to a commercial reservoir simulator · 2015. 3. 19. · introducing gpus to a...

Documents