diamondtile algorithm for high-performance wave modeling · flopsandbandwidthperformanceratio 10...

DiamondTile Algorithm for High-Performance Wave Modeling

Vadim LevchenkoAnastasia Perepelkina

Keldysh Institute of Apllied Mathematics RAS

GTC 2015

FLOPs and Bandwidth Performance Ratio

10

100

1000

0.1 1 10

GB/s

TFLOP/s (fp32)

nVidia Maxwell, 2014-15nVidia Kepler, 2012-13

Intel CPU, 2014NEC SX, 199x

0.1 By

tes/Flo

ps

0.04 B

ytes/Fl

ops4 B

ytes/Fl

ops

RoofLine modelS. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicorearchitectures. Commun. ACM, 52:65–76, 2009.

L.Barba, R.Yokota, How Will the Fast Multipole Method Fare in the Exascale Era?CSE13

Wave Modeling Specifics∂2F∂t2 = c

2(

∂2F∂x2 +

∂2F∂y2 +

∂2F∂z2

)(+BCs + ICs)

Finite difference along each axis:∂2F∂x2

∣∣∣x0,y0,z0

= 1∆x2

∑NO/2i=0 Ci (F |x0+i∆x ,y0,z0 + F |x0−i∆x ,y0,z0)

NO = 2 for ∂2F∂t2 , NO = 2, 4, 6, ..14 for coordinate axes.

Per one cell, per one time step calculation:I O = 1+ 3NO FMA operationsI D = 3+ 3NO data

Operational intensity:O/D ∼ 1/2 Flop/byte (näıve algorithm) .

xy

t

x2+y

2 +z2

=c2 t2

domain ofinfluence

domain ofdependence

asynchro-nous domain

asynchro-nousdomain

synchronization instant

Wave Modeling Specifics∂2F∂t2 = c

2(

∂2F∂x2 +

∂2F∂y2 +

∂2F∂z2

)(+BCs + ICs)

Finite difference along each axis:∂2F∂x2

∣∣∣x0,y0,z0

= 1∆x2

∑NO/2i=0 Ci (F |x0+i∆x ,y0,z0 + F |x0−i∆x ,y0,z0)

NO = 2 for ∂2F∂t2 , NO = 2, 4, 6, ..14 for coordinate axes.

.Cross-shaped stencil fits into diamond shape

xy

t

x2+y

2 +z2

=c2 t2

domain ofinfluence

domain ofdependence

asynchro-nous domain

asynchro-nousdomain

synchronization instant

Wave equation modellingComputational Grid projection to (x–t)

Wave equation modelling

Traditional stepwise evaluation order

Traditional stepwise evaluation order

Overlapping stencils increase operationalintensity:

I O = 1+ 3NO FMA operationsI D = 3 data

Operational intensity:O/D ∼ (1+ NO) Flop/byte

RoofLine Model for Wave Equation on GPGPU

10

100

1000

0.1 1 10

perfo

rman

ce, 1

09 c

ells

/sec

localization parameter, cells calculations/(data loads+stores)

the

best

of s

tepw

ise

nai

ve

CUD

A FD

TD3d

resu

lts

TitanZ

GTX 970

LRnLA method

LRnLA methodLocality Take advantage of memory subsystem hierarchy, from on-chip CPU cash

and up to disk and networkRecursivity Application of “divide et impera” strategy for any situations (computer

architectures, numerical schemes, etc.)non-Locality Optimized for distributed computationsAsynchrony Adaptable parallel computations on any levels

Memory Subsystem Hierarchy for GPGPU and CPU. GK110 Haswell GM204 .. GTX Titan Xeon E5 v3 GTX 980 .

109

1010

1011

1012

1013

1014

1T 1G 1M 1K 1M 1G 1T

Data

thro

ughp

ut, B

/sec

Data set size, B

regs

L1+sh

L2GDDR5

regs

L1+shL2

GDDR5

regs

L1

L2LLC

DDR4

SSD/PCIe

HDD

DiamondTile based algorithm constructionComputational grid in x-y and x-t projections

DiamondTile based algorithm constructionComputational domain is subdivided into Diamond shaped tiles in x-y.

I Diamond encloses cross-shaped stencilI All elements along 3rd (z) axis are included

DiamondTile based algorithm constructionv Choose a DiamondTile on first time-step

DiamondTile based algorithm constructionv Choose a DiamondTile on first time-stepv Plot influence cone of first tile

DiamondTile based algorithm constructionv Choose a DiamondTile on first time-stepv Plot influence cone of first tilev Choose a shifted DiamondTile on another time-step (Nt steps later)

DiamondTile based algorithm constructionv Choose a DiamondTile on first time-stepv Plot influence cone of first tilev Choose a shifted DiamondTile on another time-step (Nt steps later)v Plot dependence cone of last tile

DiamondTile based algorithm constructionv Choose a DiamondTile on first time-stepv Plot influence cone of first tilev Choose a shifted DiamondTile on another time-step (Nt steps later)v Plot dependence cone of last tilev Find intersection

DiamondTorre Algorithm shape

Understand Algorithm as a shapeStepwise

Understand Algorithm as a shapeDomain decomposition

Understand Algorithm as a shapeMore operational intensity

Understand Algorithm as a shapeDiamondTorre

DiamondTorre Algorithm shapeI DiamondTorre tilt depends on stencil sizeI Stencil width is determined by order of approximation (NO)

DiamondTorre Algorithm parametersPerformance depends on careful choice of algorithm parameters:

I Size of DiamondTorre base — Diamond Tile Size, DTSI Quantity of time layers — Nt

Operational Intensity ∼ DTS/(4-1/DTS) (for large Nt)


10

100

1000

0.1 1 10

perfo

rman

ce, 1

09 c

ells

/sec


Diam

ondT

ile, D

TS=

1

DTS=

4

DTS=

7 DTS=

14

DTS=

20

the

best

of s

tepw

ise

DT

S=1

nai

ve

Diamon

dTorre fo

r variou

s DTS

TitanZ

GTX 970

DiamondTorre Algorithm with CUDAIn each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDAthreads in a block

First stage


Second stage


Odd and even stages are alternating. Synchronization after eachstage.

DiamondTorre Algorithm with CUDAv At first, some portion of cells remain on first time step, while some are processed toseveral time layers

DiamondTorre Algorithm with CUDAv At the end, all data are progressed to a given time step. This time step isdetermined by DiamondTorre height


10

100

1000

0.1 1 10

perfo

rman

ce, 1

09 c

ells

/sec


Diam

ondT

ile, D

TS=

1

DTS=

4

DTS=

7 DTS=

14

DTS=

20

the

best

of s

tepw

ise

DT

S=1

nai

ve

Diamon

dTorre fo

r variou

s DTS

CUD

A FD

TD3d

resu

lts

TitanZ

GTX 970

0

10

20

30

40

50

60

2/1 4/1 6/1 8/1 10/112/114/1 6/1 6/2 4/1 4/2 4/3 2/1 2/2 2/3 2/4 2/5 2/6 2/7

calc

rate

, Gce

lls/s

ec

various scheme/algorithm parameters, NO/DTS

GTX 750TiGTX 970

TitanZ (1)

0.01

0.1

1

10

100

0.01 0.1 1 10 100 1000

calc

rate

, Gce

lls/s

ec

parallel level, warps

FDTD3d CPU rate

FDTD3d CPU rate with -O3

FDTD3d TitanZ rateFDTD3d GTX970 rate

TitanZGTX970

Wave Modeling Applications

FDTD simulation for electromagnetics (2nd and 4th order approximation, PML)(Zakirov A., Goryachev I.)


Gas Dynamis with RKDG scheme (Korneev B.)


2000 3000 4000 5000 6000 7000

07.53.75

0-3.75

-7.5 6

6 4

4 2

200

7.53.75-3.75 0-7.5

FDTD simulation for elastic seismic media (Levander scheme, 4th order, PML,Thompsen anisotropy, TFSF source) (Levchenko V., Zakirov A., Perepelkina A.,

Ivanov A.)


Particle-in-cell plasma kinetics (Levchenko V., Perepelkina A., Goryachev I.)

Main Results and ConclusionsI New algorithms DiamondTile of LRnLA family are developed for wave modeling.

The algorithms are efficient on memory and parallelism models of CUDA GPGPU;I Unlike traditional stepwise evaluation order, data dependencies are traced for many

time iteration steps. It increases operational intensity and allows to reach highercalculation rates.

I Performance of 50-60 billion cells/s is achieved with Titan, as well as withGTX970 in the implementation of wave modeling.

diamondtile algorithm for high-performance wave modeling · flopsandbandwidthperformanceratio 10...

Documents