diamondtile algorithm for high-performance wave modeling · flopsandbandwidthperformanceratio 10...
TRANSCRIPT
-
DiamondTile Algorithm for High-Performance Wave Modeling
Vadim LevchenkoAnastasia Perepelkina
Keldysh Institute of Apllied Mathematics RAS
GTC 2015
-
FLOPs and Bandwidth Performance Ratio
10
100
1000
0.1 1 10
GB/s
TFLOP/s (fp32)
nVidia Maxwell, 2014-15nVidia Kepler, 2012-13
Intel CPU, 2014NEC SX, 199x
0.1 By
tes/Flo
ps
0.04 B
ytes/Fl
ops4 B
ytes/Fl
ops
-
RoofLine modelS. Williams, A. Waterman, and D. Patterson. Roofline: an insightful visual performance model for multicorearchitectures. Commun. ACM, 52:65–76, 2009.
L.Barba, R.Yokota, How Will the Fast Multipole Method Fare in the Exascale Era?CSE13
-
Wave Modeling Specifics∂2F∂t2 = c
2(
∂2F∂x2 +
∂2F∂y2 +
∂2F∂z2
)(+BCs + ICs)
Finite difference along each axis:∂2F∂x2
∣∣∣x0,y0,z0
= 1∆x2
∑NO/2i=0 Ci (F |x0+i∆x ,y0,z0 + F |x0−i∆x ,y0,z0)
NO = 2 for ∂2F∂t2 , NO = 2, 4, 6, ..14 for coordinate axes.
Per one cell, per one time step calculation:I O = 1+ 3NO FMA operationsI D = 3+ 3NO data
Operational intensity:O/D ∼ 1/2 Flop/byte (näıve algorithm) .
xy
t
x2+y
2 +z2
=c2 t2
domain ofinfluence
domain ofdependence
asynchro-nous domain
asynchro-nousdomain
synchronization instant
-
Wave Modeling Specifics∂2F∂t2 = c
2(
∂2F∂x2 +
∂2F∂y2 +
∂2F∂z2
)(+BCs + ICs)
Finite difference along each axis:∂2F∂x2
∣∣∣x0,y0,z0
= 1∆x2
∑NO/2i=0 Ci (F |x0+i∆x ,y0,z0 + F |x0−i∆x ,y0,z0)
NO = 2 for ∂2F∂t2 , NO = 2, 4, 6, ..14 for coordinate axes.
.Cross-shaped stencil fits into diamond shape
xy
t
x2+y
2 +z2
=c2 t2
domain ofinfluence
domain ofdependence
asynchro-nous domain
asynchro-nousdomain
synchronization instant
-
Wave equation modellingComputational Grid projection to (x–t)
-
Wave equation modellingComputational Grid projection to (x–t)
-
Wave equation modelling
-
Wave equation modelling
-
Wave equation modelling
-
Traditional stepwise evaluation order
-
Traditional stepwise evaluation order
-
Traditional stepwise evaluation order
-
Traditional stepwise evaluation order
Overlapping stencils increase operationalintensity:
I O = 1+ 3NO FMA operationsI D = 3 data
Operational intensity:O/D ∼ (1+ NO) Flop/byte
-
RoofLine Model for Wave Equation on GPGPU
10
100
1000
0.1 1 10
perfo
rman
ce, 1
09 c
ells
/sec
localization parameter, cells calculations/(data loads+stores)
the
best
of s
tepw
ise
nai
ve
CUD
A FD
TD3d
resu
lts
TitanZ
GTX 970
-
LRnLA method
-
LRnLA method
-
LRnLA methodLocality Take advantage of memory subsystem hierarchy, from on-chip CPU cash
and up to disk and networkRecursivity Application of “divide et impera” strategy for any situations (computer
architectures, numerical schemes, etc.)non-Locality Optimized for distributed computationsAsynchrony Adaptable parallel computations on any levels
-
Memory Subsystem Hierarchy for GPGPU and CPU. GK110 Haswell GM204 .. GTX Titan Xeon E5 v3 GTX 980 .
109
1010
1011
1012
1013
1014
1T 1G 1M 1K 1M 1G 1T
Data
thro
ughp
ut, B
/sec
Data set size, B
regs
L1+sh
L2GDDR5
regs
L1+shL2
GDDR5
regs
L1
L2LLC
DDR4
SSD/PCIe
HDD
-
DiamondTile based algorithm constructionComputational grid in x-y and x-t projections
-
DiamondTile based algorithm constructionComputational domain is subdivided into Diamond shaped tiles in x-y.
I Diamond encloses cross-shaped stencilI All elements along 3rd (z) axis are included
-
DiamondTile based algorithm constructionv Choose a DiamondTile on first time-step
-
DiamondTile based algorithm constructionv Choose a DiamondTile on first time-stepv Plot influence cone of first tile
-
DiamondTile based algorithm constructionv Choose a DiamondTile on first time-stepv Plot influence cone of first tilev Choose a shifted DiamondTile on another time-step (Nt steps later)
-
DiamondTile based algorithm constructionv Choose a DiamondTile on first time-stepv Plot influence cone of first tilev Choose a shifted DiamondTile on another time-step (Nt steps later)v Plot dependence cone of last tile
-
DiamondTile based algorithm constructionv Choose a DiamondTile on first time-stepv Plot influence cone of first tilev Choose a shifted DiamondTile on another time-step (Nt steps later)v Plot dependence cone of last tilev Find intersection
-
DiamondTorre Algorithm shape
-
Understand Algorithm as a shapeStepwise
-
Understand Algorithm as a shapeDomain decomposition
-
Understand Algorithm as a shapeMore operational intensity
-
Understand Algorithm as a shapeDiamondTorre
-
DiamondTorre Algorithm shapeI DiamondTorre tilt depends on stencil sizeI Stencil width is determined by order of approximation (NO)
-
DiamondTorre Algorithm parametersPerformance depends on careful choice of algorithm parameters:
I Size of DiamondTorre base — Diamond Tile Size, DTSI Quantity of time layers — Nt
Operational Intensity ∼ DTS/(4-1/DTS) (for large Nt)
-
RoofLine Model for Wave Equation on GPGPU
10
100
1000
0.1 1 10
perfo
rman
ce, 1
09 c
ells
/sec
localization parameter, cells calculations/(data loads+stores)
Diam
ondT
ile, D
TS=
1
DTS=
4
DTS=
7 DTS=
14
DTS=
20
the
best
of s
tepw
ise
DT
S=1
nai
ve
Diamon
dTorre fo
r variou
s DTS
TitanZ
GTX 970
-
DiamondTorre Algorithm with CUDAIn each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDAthreads in a block
First stage
-
DiamondTorre Algorithm with CUDAIn each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDAthreads in a block
Second stage
-
DiamondTorre Algorithm with CUDAIn each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDAthreads in a block
Odd and even stages are alternating. Synchronization after eachstage.
-
DiamondTorre Algorithm with CUDAIn each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDAthreads in a block
Odd and even stages are alternating. Synchronization after eachstage.
-
DiamondTorre Algorithm with CUDAIn each tile Nz elements (along 3rd (z) axis) are processed asynchronously by CUDAthreads in a block
Odd and even stages are alternating. Synchronization after eachstage.
-
DiamondTorre Algorithm with CUDAv At first, some portion of cells remain on first time step, while some are processed toseveral time layers
-
DiamondTorre Algorithm with CUDAv At first, some portion of cells remain on first time step, while some are processed toseveral time layers
-
DiamondTorre Algorithm with CUDAv At first, some portion of cells remain on first time step, while some are processed toseveral time layers
-
DiamondTorre Algorithm with CUDAv At the end, all data are progressed to a given time step. This time step isdetermined by DiamondTorre height
-
RoofLine Model for Wave Equation on GPGPU
10
100
1000
0.1 1 10
perfo
rman
ce, 1
09 c
ells
/sec
localization parameter, cells calculations/(data loads+stores)
Diam
ondT
ile, D
TS=
1
DTS=
4
DTS=
7 DTS=
14
DTS=
20
the
best
of s
tepw
ise
DT
S=1
nai
ve
Diamon
dTorre fo
r variou
s DTS
CUD
A FD
TD3d
resu
lts
TitanZ
GTX 970
-
0
10
20
30
40
50
60
2/1 4/1 6/1 8/1 10/112/114/1 6/1 6/2 4/1 4/2 4/3 2/1 2/2 2/3 2/4 2/5 2/6 2/7
calc
rate
, Gce
lls/s
ec
various scheme/algorithm parameters, NO/DTS
GTX 750TiGTX 970
TitanZ (1)
-
0.01
0.1
1
10
100
0.01 0.1 1 10 100 1000
calc
rate
, Gce
lls/s
ec
parallel level, warps
FDTD3d CPU rate
FDTD3d CPU rate with -O3
FDTD3d TitanZ rateFDTD3d GTX970 rate
TitanZGTX970
-
Wave Modeling Applications
FDTD simulation for electromagnetics (2nd and 4th order approximation, PML)(Zakirov A., Goryachev I.)
-
Wave Modeling Applications
Gas Dynamis with RKDG scheme (Korneev B.)
-
Wave Modeling Applications
2000 3000 4000 5000 6000 7000
07.53.75
0-3.75
-7.5 6
6 4
4 2
200
7.53.75-3.75 0-7.5
FDTD simulation for elastic seismic media (Levander scheme, 4th order, PML,Thompsen anisotropy, TFSF source) (Levchenko V., Zakirov A., Perepelkina A.,
Ivanov A.)
-
Wave Modeling Applications
Particle-in-cell plasma kinetics (Levchenko V., Perepelkina A., Goryachev I.)
-
Main Results and ConclusionsI New algorithms DiamondTile of LRnLA family are developed for wave modeling.
The algorithms are efficient on memory and parallelism models of CUDA GPGPU;I Unlike traditional stepwise evaluation order, data dependencies are traced for many
time iteration steps. It increases operational intensity and allows to reach highercalculation rates.
I Performance of 50-60 billion cells/s is achieved with Titan, as well as withGTX970 in the implementation of wave modeling.