introduction to dataflow computing - oerc.ox.ac.uk · visage – geomechanics (2 node nehalem 2.93...

29
Maxeler Dataflow Computing Workshop Introduction to Dataflow Computing STFC Hartree Centre, June 2013

Upload: nguyentruc

Post on 16-Apr-2018

221 views

Category:

Documents


2 download

TRANSCRIPT

Maxeler Dataflow Computing Workshop

Introduction to Dataflow Computing

STFC Hartree Centre, June 2013

Programmable Spectrum

2

Single-Core CPU Multi-Core Several-Cores Dataflow

Intel, AMD GPU (NVIDIA, AMD) Tilera, XMOS etc... Maxeler

Hybrid e.g. AMD Fusion, IBM Cell

Control-flow processors Dataflow processor

Increasing Parallelism (#cores)

Increasing Core Complexity

Many-Cores

GK110

Acceleration Potential

• Ten times slower clock

• Degrees of Freedom – Architecture – Data type

• Massive parallelism – Bit level – Pipeline level – Architecture level – System level

3

+100× -10×

Processor Performance

Dataflow Dataflow

Where silicon is used? Intel 6-Core X5680 “Westmere”

4

Where silicon is used? Intel 6-Core X5680 “Westmere”

Dataflow Processor

5

Computation

MaxelerOS

Computation (Dataflow cores)

6

Control flow Microprocessor (CPU)

7

Dataflow Engine (DFE)

8

Explaining Control Flow versus Data Flow

• Many specialized workers are more efficient (data flow)

• Experts are expensive and slow (control flow)

Analogy: The Ford Production Line

On chip resources • Each application has a

different configuration of dataflow cores

• Dataflow cores are built out of basic operating resources on-chip

9

DSP Resource

RAM Resource (10TB/s)

General Logic Resource

Maxeler Hardware Solutions

10

CPUs plus DFEs Intel Xeon CPU cores and up to

6 DFEs with 288GB of RAM

DFEs shared over Infiniband Up to 8 DFEs with 384GB of RAM and dynamic allocation

of DFEs to CPU servers

Low latency connectivity Intel Xeon CPUs and 1-2 DFEs with up to six 10Gbit Ethernet

connections

MaxWorkstation Desktop development system

MaxCloud On-demand scalable accelerated compute resource, hosted in London

• 1U Form Factor • 4x dataflow engines • 12 Intel Xeon cores • 96GB DFE RAM • Up to 192GB CPU RAM • MaxRing interconnect • 3x 3.5” hard drives • Infiniband

11

MPC-C500

MPC-X1000

• 8 dataflow engines (192-384GB RAM)

• High-speed MaxRing • Zero-copy RDMA between

CPUs and DFEs over Infiniband • Dynamic CPU/DFE balancing

12

• Finite Difference Modeling • Reverse Time Migration • CRS stacking • Sparse Matrix Solving • Credit Derivatives Pricing

Application Examples

13

• Geophysical Model – 3D acoustic wave equation

– Variable velocity and density – Isotropic medium

• Numerical Model – Finite differences (12th order convolution) – 4th order in time – Point source, absorbing boundary conditions

3D Finite Difference Modeling T. Nemeth et al, 2008

14

FD Implementation Options

Option 1: Uni-Axial

Option 2: 23-pt Tri-axial

Option 3: 11-pt Tri-axial

Option 4: Composite Uni-Axial

15

Modeling Results • Up to 240x speedup

for 1 MAX2 card compared to single CPU core

• Speedup increases with cube size

• 1 billion point modeling domain using single card

16

0

50

100

150

200

250

300

0 200 400 600 800 1000Spee

dup

com

pare

d to

sing

le co

re

Domain size (n^3)

FD Modeling Performance

Reverse Time Migration • Accelerated RTM uses

dataflow accelerated FD modeling propagator

• Speedup depends on RTM scheme – Diskless schemes allow

full DFE performance to be exploited

17

W. Liu et al, 2009

(a) CPU (b) DFE

Accelerated 3D VTI RTM image on Hess Model

Relationship between Forward Modeling speedup and RTM speedup, for different RTM schemes

• Velocity independent / data driven method to obtain a stack, based on 8 parameters – Search for every sample of each output trace

18

CRS Trace Stacking P. Marchetti et al, 2010

2 parameters ( emergence angle & azimuth )

3 Normal Wave front parameters ( KN,11; KN,12 ; KN22 )

3 NIP Wave front parameters ( KNip,11; KNip,12 ; KNip22 )

( )hHKHhmHKHmmw TzyNIPzy

TTzyNzy

TT ++

+=

0

0

2

00

2 22vt

vtthyp

• Search in 8 dimensional parameter space, and evaluate result by calculating semblance

• ti comes from the CRS travel-time formula:

19

3D CRS

2

2

2 1

2

,

2

2 1,

001),(

∑ ∑

∑ ∑

−= =+

−= =+

= N

Nk

M

ikti

N

Nk

M

ikti

i

i

a

a

MtxS

( )hHKHhmHKHmmw T

zyNIPzy

TT

zyNzy

TT ++

+=

0

0

2

0

0

2 22vt

vtthyp

• Runtime dominated by travel time and semblance calculation

• CPU: compute samples in series • DFE: compute multiple samples in

parallel

20

CRS Application Analysis

Semblance, 91.42%

Hilbert, 0.01%

Coherency, 0.87%Traveltime,

7.66%

1 t0 sample 16 t0 samples 64 t0 samples

• Performance of one MAX2 card vs. 1 CPU core – Land case (8 params), speedup of 230x – Marine case (6 params), speedup of 190x

21

CRS Results

CPU Coherency MAX2 Coherency

• Sparse matrices are used in a variety of important applications

• Matrix solving. Given matrix A, vector b, find vector x in:

Ax = b • Direct or iterative solver • Structured vs. unstructured matrices

22

Sparse Matrix Solving O. Lindtjorn et al, 2010

Typical Scalability of Sparse Matrix

23

Visage – Geomechanics (2 node Nehalem 2.93 GHz)

Eclipse Benchmark (2 node Westmere 3.06 GHz)

0

1

2

3

4

0 2 4 6 8 10 12

Rela

tive

Spee

d

# cores

E300 2 Mcell Benchmark

012345

0 2 4 6 8

Rela

tive

Spee

d

# cores

FEM Benchmark

Sparse Matrix in DFE

24

0

10

20

30

40

50

60

0 1 2 3 4 5 6 7 8 9 10

Compression Ratio

Spee

dup

per 1

U N

ode

GREE0A1new01

624

624 Maxeler Domain Specific Address and Data Encoding

SPEEDUP is 20x-40x per 1U at 200MHz

Credit Derivatives Valuation & Risk • Compute value of

complex financial derivatives (CDOs)

• Typically run overnight, but beneficial to compute in real-time

• Many independent jobs • Speedup: 220-270x • Power consumption per

node drops from 250W to 235W/node

25

O. Mencer and S. Weston, 2010

Application Analysis

26

DFE Convolution Architecture

27

• Calculation of current value and credit spread risk for population of 2,925 bespoke tranches.

• Speedup from 1 MAX2: – 219 – 270x compared to 1 core – ~30x compared to 8-core node

• Power consumption drops from 250W/node to 235W/node with acceleration

28

Credit Derivatives Results

• Dataflow engines provide massive parallelism at low clock frequencies

• Many applications are amenable to dataflow processing, and can achieve high acceleration

29

Summary & Conclusions