introduction to dataflow computing - oerc.ox.ac.uk · visage – geomechanics (2 node nehalem 2.93...
TRANSCRIPT
Maxeler Dataflow Computing Workshop
Introduction to Dataflow Computing
STFC Hartree Centre, June 2013
Programmable Spectrum
2
Single-Core CPU Multi-Core Several-Cores Dataflow
Intel, AMD GPU (NVIDIA, AMD) Tilera, XMOS etc... Maxeler
Hybrid e.g. AMD Fusion, IBM Cell
Control-flow processors Dataflow processor
Increasing Parallelism (#cores)
Increasing Core Complexity
Many-Cores
GK110
Acceleration Potential
• Ten times slower clock
• Degrees of Freedom – Architecture – Data type
• Massive parallelism – Bit level – Pipeline level – Architecture level – System level
3
+100× -10×
Processor Performance
Dataflow Dataflow
Where silicon is used? Intel 6-Core X5680 “Westmere”
Dataflow Processor
5
Computation
MaxelerOS
Computation (Dataflow cores)
8
Explaining Control Flow versus Data Flow
• Many specialized workers are more efficient (data flow)
• Experts are expensive and slow (control flow)
Analogy: The Ford Production Line
On chip resources • Each application has a
different configuration of dataflow cores
• Dataflow cores are built out of basic operating resources on-chip
9
DSP Resource
RAM Resource (10TB/s)
General Logic Resource
Maxeler Hardware Solutions
10
CPUs plus DFEs Intel Xeon CPU cores and up to
6 DFEs with 288GB of RAM
DFEs shared over Infiniband Up to 8 DFEs with 384GB of RAM and dynamic allocation
of DFEs to CPU servers
Low latency connectivity Intel Xeon CPUs and 1-2 DFEs with up to six 10Gbit Ethernet
connections
MaxWorkstation Desktop development system
MaxCloud On-demand scalable accelerated compute resource, hosted in London
• 1U Form Factor • 4x dataflow engines • 12 Intel Xeon cores • 96GB DFE RAM • Up to 192GB CPU RAM • MaxRing interconnect • 3x 3.5” hard drives • Infiniband
11
MPC-C500
MPC-X1000
• 8 dataflow engines (192-384GB RAM)
• High-speed MaxRing • Zero-copy RDMA between
CPUs and DFEs over Infiniband • Dynamic CPU/DFE balancing
12
• Finite Difference Modeling • Reverse Time Migration • CRS stacking • Sparse Matrix Solving • Credit Derivatives Pricing
Application Examples
13
• Geophysical Model – 3D acoustic wave equation
– Variable velocity and density – Isotropic medium
• Numerical Model – Finite differences (12th order convolution) – 4th order in time – Point source, absorbing boundary conditions
3D Finite Difference Modeling T. Nemeth et al, 2008
14
FD Implementation Options
Option 1: Uni-Axial
Option 2: 23-pt Tri-axial
Option 3: 11-pt Tri-axial
Option 4: Composite Uni-Axial
15
Modeling Results • Up to 240x speedup
for 1 MAX2 card compared to single CPU core
• Speedup increases with cube size
• 1 billion point modeling domain using single card
16
0
50
100
150
200
250
300
0 200 400 600 800 1000Spee
dup
com
pare
d to
sing
le co
re
Domain size (n^3)
FD Modeling Performance
Reverse Time Migration • Accelerated RTM uses
dataflow accelerated FD modeling propagator
• Speedup depends on RTM scheme – Diskless schemes allow
full DFE performance to be exploited
17
W. Liu et al, 2009
(a) CPU (b) DFE
Accelerated 3D VTI RTM image on Hess Model
Relationship between Forward Modeling speedup and RTM speedup, for different RTM schemes
• Velocity independent / data driven method to obtain a stack, based on 8 parameters – Search for every sample of each output trace
18
CRS Trace Stacking P. Marchetti et al, 2010
2 parameters ( emergence angle & azimuth )
3 Normal Wave front parameters ( KN,11; KN,12 ; KN22 )
3 NIP Wave front parameters ( KNip,11; KNip,12 ; KNip22 )
( )hHKHhmHKHmmw TzyNIPzy
TTzyNzy
TT ++
+=
0
0
2
00
2 22vt
vtthyp
• Search in 8 dimensional parameter space, and evaluate result by calculating semblance
• ti comes from the CRS travel-time formula:
19
3D CRS
2
2
2 1
2
,
2
2 1,
001),(
∑ ∑
∑ ∑
−= =+
−= =+
= N
Nk
M
ikti
N
Nk
M
ikti
i
i
a
a
MtxS
( )hHKHhmHKHmmw T
zyNIPzy
TT
zyNzy
TT ++
+=
0
0
2
0
0
2 22vt
vtthyp
• Runtime dominated by travel time and semblance calculation
• CPU: compute samples in series • DFE: compute multiple samples in
parallel
20
CRS Application Analysis
Semblance, 91.42%
Hilbert, 0.01%
Coherency, 0.87%Traveltime,
7.66%
1 t0 sample 16 t0 samples 64 t0 samples
• Performance of one MAX2 card vs. 1 CPU core – Land case (8 params), speedup of 230x – Marine case (6 params), speedup of 190x
21
CRS Results
CPU Coherency MAX2 Coherency
• Sparse matrices are used in a variety of important applications
• Matrix solving. Given matrix A, vector b, find vector x in:
Ax = b • Direct or iterative solver • Structured vs. unstructured matrices
22
Sparse Matrix Solving O. Lindtjorn et al, 2010
Typical Scalability of Sparse Matrix
23
Visage – Geomechanics (2 node Nehalem 2.93 GHz)
Eclipse Benchmark (2 node Westmere 3.06 GHz)
0
1
2
3
4
0 2 4 6 8 10 12
Rela
tive
Spee
d
# cores
E300 2 Mcell Benchmark
012345
0 2 4 6 8
Rela
tive
Spee
d
# cores
FEM Benchmark
Sparse Matrix in DFE
24
0
10
20
30
40
50
60
0 1 2 3 4 5 6 7 8 9 10
Compression Ratio
Spee
dup
per 1
U N
ode
GREE0A1new01
624
624 Maxeler Domain Specific Address and Data Encoding
SPEEDUP is 20x-40x per 1U at 200MHz
Credit Derivatives Valuation & Risk • Compute value of
complex financial derivatives (CDOs)
• Typically run overnight, but beneficial to compute in real-time
• Many independent jobs • Speedup: 220-270x • Power consumption per
node drops from 250W to 235W/node
25
O. Mencer and S. Weston, 2010
• Calculation of current value and credit spread risk for population of 2,925 bespoke tranches.
• Speedup from 1 MAX2: – 219 – 270x compared to 1 core – ~30x compared to 8-core node
• Power consumption drops from 250W/node to 235W/node with acceleration
28
Credit Derivatives Results