dataflow and higher level abstractions for parallel …...dataflow programming q graph...

49
Dataflow and higher level abstractions for parallel programming Jeronimo Castrillon Chair for Compiler Construction (CCC) TU Dresden, Germany Cyber-Physical Systems Summer School 2019 Alghero, Sardinia, Italy. September 25, 2019

Upload: others

Post on 18-Sep-2020

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Dataflow and higher level abstractions for parallel programmingJeronimo CastrillonChair for Compiler Construction (CCC)TU Dresden, Germany

Cyber-Physical Systems Summer School 2019Alghero, Sardinia, Italy. September 25, 2019

Page 2: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

2

Evolution of Hardware

© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Transistors(thousands)

Single-threadPerformance

Frequency(MHz)

Typical power(W)

Core count

M. Horowitz, F. Labonte, et al. Dotted-line by C. Moore, “Data processing in exascale-class computer systems,” The Salishan Conference on High Speed Computing, 2011

Page 3: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Evolution of Hardware: Great complexity

3 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Transistors(thousands)

Single-threadPerformance

Frequency(MHz)

Typical power(W)

Core count

M. Horowitz, F. Labonte, et al. Dotted-line by C. Moore, “Data processing in exascale-class computer systems,” The Salishan Conference on High Speed Computing, 2011

q Heterogeneous many-cores, scalable platforms, complex memory hierarchies, domain-specific accelerators, …

Panda, P. R., Dutt, N. D., & Nicolau, A. Memory issues in embedded systems-on-chip: optimizations and exploration. Springer Science & Business Media. 1999

1999

Smart health

Aerospace

AI, MLNLP

Smart transport

Infotainment system

Virtual 3D

Control systems

i/po/p

Robotics HCI

DMAs, sema-

phoresPMU

Peripherals

Communicationsupport

HW queues

Network Processor

Packet DMA

MEMsubsystem

NoC

A15L1

A15L1

L2A15L1

A15L1

VLIW DSP

L1,L2

2019

Programmer

Page 4: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Models: Von Neumann

q Program: sequence of instructions in memory

4 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Central Processing Unit (CPU)

Memory & I/O

Control unit Processing unit

Inst. regProg. counter

Reg. fileALU

Good abstractions:q “High-level” languages (C, …)q Efficient HW implementation

Page 5: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Models: Von Neumann and Modern HW

q Different sources of parallelismq SIMD, threads, GPUs, …

q Fundamentally different resourcesq Dataflow accelerators, Tensor processing units, …

è Lots of effort in auto-parallelization and vectorization

5 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Sequential code

Parallel code(OpenMP, Pthreads,…)Compiler

Theorem (Allen/Kennedy): Any reordering transformation that preserves every dependence in a program preserves the meaning of that program

Page 6: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

6

Static/dynamic control/data analysis: Sample results

© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

[DAC’08]

2008

[Aguilar’18]

2018Polybenchon real Jetson TX1

Experimental multi-core from Tokyo Institute of Technology (Prof. Isshiki)

Page 7: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Problems for auto-parallelizing compilers

1) Find all dependencies?

7 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

for (i = 1; i <= 100; i++)for (j = 1; j <= 100; j++) {

S1: X[i][j] = X[i][j] + Y[i-l][j];S2: Y[i][j] = Y[i][j] + X[i][j-1];

}

i

j

More often than not, impossible!

2) Coding style and the illusion of infinite shared memory

struct a{struct s *a;int *pi;struct t *p;float *p;…}

Successful example: Polyhedral Compilation

Page 8: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Problems for auto-parallelizing compilers (2)

1) Find all dependencies?

8 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

2) Coding style and the illusion of infinite shared memory

3) Dependencies can sometimes be violated!(they are artifacts of style)

[Edler’18]

Page 9: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Models: Von Neumann and Modern HW

q Different sources of parallelismq SIMD, threads, GPUs, …

q Fundamentally different resourcesq Dataflow accelerators, Tensor processing unit, …

q Von Neumann and sequential programmingq Difficult to automatically extract parallelism

q Difficult to recognize high-level operations

q Difficult to ensure correctness (functional, timing, …)

9 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Page 10: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Models and Programming

q Specially important for CPS!q Domain specific q Interconnected systems

q Interaction with physical world (Reactive)q Real-time (WCET, predictable HW)

q Dynamic changing conditionsq …

10

Smart health

Aerospace

AI, MLNLP

Smart transport

Infotainment system

Virtual 3D

Control systems

i/po/p

Robotics HCI

In this lectureq Classic dataflow modelingq Current work on extensions: dynamic, adaptable… (CPS)

q Raise-level of abstraction: Domain-specific languages

Page 11: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Dataflow programming

© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Page 12: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Dataflow programming

q Graph representation of applications: Long historyq Implicit repetitive execution of tasksq Good model for streaming applications

q Good match for signal processing & multi-media

q The whyq Explicit parallelism

q Often: Determinism q Better analyzability (scheduling, mapping, optimization)

© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

PnTransformSdfToKpn(D, S);PnTransformToArrayAccess(D, S);CollectChannelAccessRanges(D, S);

PropagateChannelAccessRanges(D, S);PnStreamFactorystreamFactory(BasePath);

switch (transTarget) {case TransM VP:

PnTransformTemplateInstantiate(D, S);

ErasePnProcessTemplates(D);

PnPrintForM VP(D, S);break;

case TransPthread:PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;

case TransSystemC:PrintForSystemC(D, S, traces,

streamFactory);ErasePnDefs(D);

break;case TransVPUtg:

PrintForVPUtg(D, S, streamFactory);

ErasePnDefs(D);break;

case TransVPUmap:PrintForVPUmap(D, S,

strM appingFileName, streamFactory);

ErasePnDefs(D);break;

case TransInvalid:assert(false);break;

}}

#include "PnTransform.h"#include "PnTVPUtg.h"

#include "PnTVPUmap.h"#include "clang/AST/ASTContext.h"#include "PnStreamFactory.h"using namespace clang;

voidclang::PnTransform(TransTarget transTarget, bool traces,

const std::string & strM appingFileName, ASTContext

&Ctx,Sema &S, const

llvm::sys::Path &BasePath) {assert(transTarget != TransInvalid);TranslationUnitDecl *D =

Ctx.getTranslationUnitDecl();PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {

case TransSystemC:

case TransVPUtg:case TransVPUmap:

PnTCopying(D, S);break;

default:;

}

PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;

case TransSystemC:PrintForSystemC(D, S, traces,

streamFactory);ErasePnDefs(D);break;

case TransVPUtg:PrintForVPUtg(D, S,

streamFactory);ErasePnDefs(D);break;

case TransVPUmap:PrintForVPUmap(D, S,

strM appingFileName, streamFactory);

ErasePnDefs(D);break;

case TransInvalid:assert(false);break;

&BasePath) {

assert(transTarget != TransInvalid);TranslationUnitDecl *D =

Ctx.getTranslationUnitDecl();PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {

case TransSystemC:case TransVPUtg:case TransVPUmap:

PnTCopying(D, S);

break;default:

;}

12

Page 13: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Example: Synchronous dataflow graph (SDF)

13 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Def.: An SDF is an annotated multi-graph G=(V,E,W). V is the set of actors and E ⊆ VxV the set of channels. W = {w1,...,w|E|} ⊂N3 is a set of annotations for every channel e = (a1,a2), we=(pe,ce,de): pe is the number of tokens produced by a firing of a1, ce the tokens consumed by a2 and de the amount of delay tokens

Originally in [Lee’87]

Page 14: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Example: Synchronous dataflow graph (SDF)

q Example

14 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Def.: An SDF is an annotated multi-graph G=(V,E,W). V is the set of actors and E ⊆ VxV the set of channels. W = {w1,...,w|E|} ⊂N3 is a set of annotations for every channel e = (a1,a2), we=(pe,ce,de): pe is the number of tokens produced by a firing of a1, ce the tokens consumed by a2 and de the amount of delay tokens

What next?

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

Originally in [Lee’87]

Page 15: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Topology matrix

q Example

15 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Def.: Given an SDF G=(V,E,W), its topology matrix 𝚪 has a row for every channel and a column for every actor. [𝚪ik] = pik – cik (i.e. the amount of tokens produced minus those consumed by actor k on/from channel i)

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

� =

0

BB@

3 �1 06 �2 00 2 �3�2 0 1

1

CCA

Page 16: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Topology matrix (2)

q State of an SDF: tokens in the queuesq The topology matrix can be used to update the state after a firing

16 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

s1 = s0 +

0

BB@

3 �1 06 �2 00 2 �3�2 0 1

1

CCA ·

0

@100

1

A =

0

BB@

0002

1

CCA+

0

BB@

360�2

1

CCA =

0

BB@

3600

1

CCA

� =

0

BB@

3 �1 06 �2 00 2 �3�2 0 1

1

CCA

Page 17: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Topology matrix (3)

q State of an SDF: tokens in the queuesq The topology matrix can be used to update the state after a firing

17 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

s2 = s1 +

0

BB@

3 �1 06 �2 00 2 �3�2 0 1

1

CCA ·

0

@010

1

A =

0

BB@

3600

1

CCA+

0

BB@

�1�220

1

CCA =

0

BB@

2420

1

CCA

� =

0

BB@

3 �1 06 �2 00 2 �3�2 0 1

1

CCA

Page 18: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Topology matrix and complete cycles

q Complete cycle: sequence of firing that brings the SDF to the original state

q What for: Automatic unrolling to express parallelism (preserve semantics)

18 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

sn = s0 + � · (v1 + v2 + · · ·+ vn) = s0

� · (v1 + v2 + · · ·+ vn) = � · ~r = 0

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2 ~r = (1, 3, 2)T

a11 a22

a31

a21

a23

a32

Page 19: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Maximum cycle mean (MCM)

q Unroll SDF is an HSDFq The MCM relates to the maximum throughput of an HSDFq Given an HSDF G=(V,E)

q Let O(G) be the set of all cycles

q A cycle C=((vi,vi+1),(vi+2,vi+3),…,(vi+n-1, vi))

19 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

MCM(G) = maxC2O(G)

Pv:(u,v)_(v,u)2C

⇢(v)P

e2Cw(e)

!

x delay tokens

WCET

Page 20: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Other models

q Trade-off: Expressivity vs analyzability q Dynamics: Compromise on analyzability

20 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

[Stuijk‘11]

Page 21: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Dynamic models and language support

q Kahn Process Networks (KPNs)q Simple syntax for programmingq Less model information: more analysis effort

q Via tracing compilers canq Identify general communication patternsq Exploit event correlations

(“dependencies”)

21 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Page 22: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

22

Programming flow: Overview

© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Dynamic Dataflow Application(CPN)

Non-functional specification

PNargs_ifft_r.ID = 6U;PNargs_ifft_r.PNchannel_freq_coef = filtered_coef_rightPNargs_ifft_r.PNnum_freq_coef = 0U;PNargs_ifft_r.PNchannel_time_coef = sink_rightPNargs_ifft_r.channel = 1;sink_left = IPCllmrf_open(3, 1, 1);sink_right = IPCllmrf_open(7, 1, 1);PNargs_sink.ID = 7U;PNargs_sink.PNchannel_in_left = sink_leftPNargs_sink.PNnum_in_left = 0U;PNargs_sink.PNchannel_in_right = sink_rightPNargs_sink.PNnum_in_right = 0U;taskParams.arg0 = (xdc_UArg)&PNargs_srctaskParams.priority = 1;

ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtr&taskParams, &eb);

glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_fft_ltaskParams.priority = 1;

ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrft_Templ, &taskParams, &eb);

glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_ifft_rtaskParams.priority = 1;

ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrfft_Templ, &taskParams, &eb);

glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_sinktaskParams.priority = 1;

DMAs, sema-

phoresPMU

Peripherals

Communicationsupport

HW queues

Network Processor

Packet DMA

MEMsubsystem

NoC

A15L1

A15L1

L2A15L1

A15L1

VLIW DSP

L1,L2

Architecture model

Property models (timing, energy, errors, …)

Analysis

Synthesis

Code generation[Springer’14]

Page 23: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Programming flows

q Many: CAL, DOL, CPN, Daedalus, Ptolemy, CAPH, …

q Informationq Dataflow model: Rates, states, actions, traces, …

q Architecture model: Resources, interconnect, costs, …

q Optimization: Mapping application to hardware q Multi-objective optimizationq High-level simulation / cost models

23 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

LM

RISC

LM

RISC

LM

DSP

LM

DSP

LM

DSP

LM

DSP

ARM

SM

AHB Bus

LM: Local memorySM: Shared memory

qnt

src snk

qnt

qntdct

qnt

dctvle

dct

dct init

MJPEG application

Page 24: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Automatic mapping to heterogeneous many-cores

q Lots of research: Heuristics and meta-heuristicsq Sample results: Effort-quality trade-off

24 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

[MCSOC’16]

Multimedia apps on TI Keystone II

Page 25: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Dataflow and CPSsq Adaptivity, predictable, multi-app,

reactive, …

© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Page 26: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

System dynamics: Hybrid mappings

q Applications not so static anymore!

q Hybrid DSE: a compile and run-time approachq Enable adaptivity: malleable, multi-variant

q Run-time predictability, robustness & isolation

26 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Smart health

Aerospace

AI, MLNLP

Smart transport

Infotainment system

Virtual 3D

Control systems

i/po/p

Robotics HCI

EUF Pareto front @ compile-time

Page 27: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Data-level parallelism: Scalable and adaptive

q Change parallelism from the application specificationq Static code analysis to identify possible transformations (or via annotations)q Implementation in FIFO library (semantics preserving)q Similar to AdaPNet or parameterized SDFs

27 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

P1 P2 P3

Meta-information & configurations

Changes @ runtime

P1

P12

P3P22

P32

P1

P12

P3P32

P42

or

P22

[PARMA-DITAM’18]

[Schor’14, Desnos’13]

Page 28: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Flexible mappings: Generalized Tetris

q Given multiple canonical configs by compiler, select one at run-time

q Exploit mapping equivalences and similarities

q Related approach: Produce mapping constraints at compile-time

28 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Mapping 3 configuration1Mapping 2

configuration1Mapping configuration1

RuntimeMapping

configuration*

[ACM TACO’17, SCOPES’17b]

[Schwarzer’17]

Implemented as actual game for dissemination!

Page 29: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Sample symmetries for MJPEG

q Multimedia benchmarks on Adapteva (16 Epiphany cores)q Not always very intuitive

29 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

R R

RR

R

R

R

R

R R

RR

R

R

R

R

R R

RR

R

R

R

R

R R

RR

R

R

R

R

R R

RR

R

R

R

R

R R

RR

R

R

R

R

R R

RR

R

R

R

R

R R

RR

R

R

R

R

Best result Result “simple” Pruned (equiv. to “simple”) Pruned (equiv. to best)

[ACM TACO’17]

Page 30: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Flexible mappings: Run-time analysis

q Linux kernel: symmetry-awareq Target: Odroid XU4 (big.LITTLE)q Multi-application scenarios: audio filter (AF) and MIMO

q 1x AF

q 4 x AF q 2 x AF + 2 x MIMO

q 3 mappings to two processors q T1: Best CPU time

q T2: Best wall-clock timeq T3: GBM heuristic

30 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

●●●

12

14

16

18

CFS Dyn T1 T2 T3

CPU

tim

e [s

]

●●

●●

10

12

14

16

CFS Dyn T1 T2 T3

Ener

gy [J

]

●●●●

●●

4

5

6

7

8

CFS Dyn T1 T2 T3

Wal

l−cl

ock

time

[s]

Single AF

[Castrill12]

[SCOPES’17b]

Page 31: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Flexible mappings: Multi-application results (1)

31 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

●●

●● ●

●●

●●

●●

●●

6.5

7.0

7.5

8.0

8.5

9.0

9.5

CFS Dyn T1 T2 T3

Wal

l−cl

ock

time

[s]

●●

●●●●●●

●●●

●●●●●●●●●●●●●

●●●

●●●●●●

●●●●●●●

10

20

30

40

AF MIMO

CPU

time

[s]

●●●● ●●

●● ●●●●●●●

●●●

●●●●

●● ●●

10

20

30

AF MIMO

Wal

l−clo

ck ti

me

[s] More

predictable performance

Comparable performance to dynamic mapping

●●●●●●

●●●●

●●●

● ● ● ●

● ●

●●

●●

●●

● ● ● ●

12.5

15.0

17.5

20.0

CFS Dyn T1 T2 T3

CPU

tim

e [s

]

[SCOPES’17b]

Page 32: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Robustness

q Static mappings, transformed or not, provide good predictabilityq However: Many things out of control

q Application data, unexpected interrupts, unexpected OS decisions

è Can we reason about robustness of mapping to external factors?

32 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Mapping 3 configuration1Mapping 2

configuration1Mapping configuration1

RuntimeMapping

configuration* Perturb Modified behavior(correct?)

Page 33: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Design centering

q Design centering: Find a mapping that can better tolerate variations while staying feasible

q Studied field, in e.g., biology, circuit design or manufacturing systems.

q Currentlyq Using a bio-inspired algorithmq Robust against OS changes to the mapping

33 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

optimum

design center in feasible region

x1

x2

feasible region

[SCOPES’17a]

Page 34: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Design centering: Algorithmic

q Intuition: Find the center and the form of a region, in which parameters deliver a correct solution

q Formally q A: Set of correct solutionsq P: Hitting probability

q L: Generic metric space

q Searching: Allow annealing (dynamically change P)

34 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

[SCOPES’17a]

Page 35: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Evaluation

q Analyze how robust the center really isq Perturbate mappings and check how often the constraints are missedq Signal processing applications on clustered ARM manycore and NoC manycore (16)

35 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

MIMO-OFDM

[SCOPES’17a]

Page 36: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Ongoing work: Improve representations

q Work on embeddings: Architectures à Real numbers q Novel mapping representations: faster heuristics & more efficient heuristics?q Example: T-SNE Visualization for mappings space (8 tasks on Odroid XU4)

36 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

[MCSoC’18]

Page 37: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Domain-specific abstractions

© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Page 38: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Higher-level Abstractions

q Dataflow: Nice abstraction across domains (maybe too overloaded)q Need to match domain-specific accelerators with domain-specific abstractions

38 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

https://www.hpcwire.com/2017/04/10/nvidia-responds-google-tpu-benchmarking/

var input A : matrixvar input u : tensorIN

v = (A # A # A # u . [[5 8] [3 7] [1 6]])

Domain-specific language

[RWDSL’18]

Page 39: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Example: CFDlang

q Origin: Spectral element methods in Computational Fluid Dynamics

q Simple expression language, formal semantics, domain expert optimizations

39 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

source = type matrix : [mp np] &type tensorIN : [np np np ne] &type tensorOUT : [mp mp mp me] &

&var input A : matrix &var input u : tensorIN &var input output v : tensorOUT &var input alpha : [] &var input beta : [] &

&v = alpha * (A # A # A # u .

[[5 8] [3 7] [1 6]]) + beta * v

Fortran embedding

[GPCE’17, RWDSL’18]

Page 40: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Tensor languages (hype): Machine learning

q High-level dataflow graph, low-level tensor operators

q Many existing frameworks, e.g., TVM, Tensor Comprehensions, TensorFlow, …

q Recent workq Cross-domain tensor transformations

q Formal semantics for safety checks(particularly important for CPS!)

q Optimization for emerging memories

40 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

[Chen’18]Example flow: TVM

[GPCE’18]

[Array’19]

[LCTES’19]

Page 41: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Correct by construction

q Similar to formal MoCsq Correct by designq No abstraction leaks

q Current efforts in formal semantics for safe code generation

41 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

[Array’19]

A = placeholder((m,h), name='A')

B = placeholder((n,h), name='B')

k = reduce_axis((0, h), name='k')

C = compute((m, n), lambda i, j:

sum(A[k, i] * B[k, j], axis=k))

𝐶%& = ()*+

,

𝐴)%𝐵)&

Page 42: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Emerging architectures

q Lots of research on tensor computations on tensor accelerators! q Recently looking into emerging memory architectures

q Hybrid architectures STT/Re-RAMq Racetrack memories

q Racetrack memories: Predicted extreme density at low latencyq Multiple bits stored in nano-wires with magnetic domainsq Non-volatile: Energy efficiency!

© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 201942

e.g., [Kim’19]

[TVLSI’18, TVLSI’19]

Need to exploit sequentially on top of locality

[IEEE CAL’19]

Page 43: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Architecture and data layout optimization

q Architecture – software co-optimizationq Embedded system for inference: RTM as scratchpad with pre-shifting and other

optimizations

43[LCTES’19]

© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Page 44: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Architecture and data layout optimization

q Data-layout: Reduce the number of shifts

44[LCTES’19]

n

DB

C:2n

DB

C:2n+1

DB

C:3n-1

R0

R1

Rn-1

C00 C01

n

R0

R1

Rn-1

DB

C:0

DB

C:1

DB

C:n-1

A00 A01 A0n-1

A10 A1n-1

An-1n-1

C0 Cn-1DBC:n DBC:n+1 DBC:2n-1

B00

B10

Bn-10

A00 A01 A0n-1

compulsory shifts

overhead shifts

B00 B10 Bn-10

compulsory shifts

overhead shifts

C1

B01

B11

Bn-11

C (Bank-2)Ã (Bank-0) ~B (Bank-1)

© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Page 45: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Latency comparison vs SRAM

q Un-optimized and naïve mapping: Even worse latency than SRAMq 24% average improvement (even with very conservative circuit simulation)

45

0

0.5

1

1.5

2

2.5

3

4 8 16 32 64 128 256 512 1024 2048 Avg

Nor

mal

ized

late

ncy

Tensor size

SRAM RTM-naïve RTM-opt RTM-opt-ps

[LCTES’19]

© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Page 46: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Energy comparison vs SRAM

q Higher savings due to less leakage powerq 74% average improvement

46

0

0.2

0.4

0.6

0.8

1

1.2

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

4 8 16 32 64 128 256 512 1024 2048

Nor

mal

ized

ene

rgy

Tensor size

Leakage Energy Read/Write Energy Shift Energy

[LCTES’19]

© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Page 47: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Summary

© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Page 48: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

Summary

q Principled methodologies for programming are a must! – preaching to the choirq Advances in auto-parallelization (polyhedral, dynamic analysis)q Explicit parallel MoC-based programming models

q Quest for more adaptivity but retaining time predictability q Higher-level abstractions: Example for tensor-based computations for accelerators

q Lots of challenges remain (thankfully)q Cost models and characterization of trade-offs (vs. blind searches)q Understand impact of emerging technologies

q Syntax and semantics (for correctness): Lots of open questions q Time-semantics in programming and execution environments

© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 201948

Page 49: Dataflow and higher level abstractions for parallel …...Dataflow programming q Graph representation of applications: Long history q Implicit repetitive execution of tasks q Good

© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

References

[DAC’08] Ceng, J., et al. "MAPS: an integrated framework for MPSoC application parallelization.”, Proceedings of the 45th annual Design Automation Conference. DAC’08.[Aguilar’18] Aguilar, M. A., Software Parallelization and Distribution for Heterogeneous Multi-Core Embedded Systems. RWTH Aachen University, RWTH Aachen University, 2018, 181pp[Edler’18] T. J.K. Edler von Koch, S. Manilov, C. Vasiladiotis, M. Cole, and B.Franke. “Towards a Compiler Analysis for Parallel Algorithmic Skeletons”, CC’18.[Lee’87] E.A. Lee and D. G. Messerschmitt. "Synchronous data flow." Proceedings of the IEEE 1987[Springer’14] J. Castrillon, R. Leupers, "Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap" , Springer, pp. 258, 2014. [MCSoC’16] A. Goens, R. Khasanov, J. Castrillon, S. Polstra, A. Pimentel, "Why Comparing System-level MPSoC Mapping Approaches is Difficult: a Case Study" , MCSoC-16.[Schor’14] Schor, L.; Bacivarov, I.; Yang, H. & Thiele, L. “AdaPNet: Adapting Process Networks in Response to Resource Variations”, ESWEEK'14, 2014

[Stuijk’11] Stuijk, S.; Geilen, M.; Theelen, B. & Basten, T. “Scenario-aware dataflow: Modeling, analysis and implementation of dynamic applications”, SAMOS’11 404-411

[Desnos’13] Desnos, K., et al. "Pimm: Parameterized and interfaced dataflow meta-model for mpsocsruntime reconfiguration." SAMOS), 2013.

[PARMA-DITAM’18] R. Khasanov, et al, "Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap" , PARMA-DITAM 18, ACM, pp. 20–25, Jan 2018.

[ACM TACO’17] Goens, A. et al. “Symmetry in Software Synthesis”. In: ACM Transactions on Architecture and Code Optimization (TACO) (2017).[SCOPES’17a] G. Hempel, et al, "Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering" SCOPES’17.[SCOPES’17b] Goens, A. et al. “TETRiS: a Multi-Application Run-Time System for Predictable Execution of Static Mappings”, SCOPES’17.

[DAC’12] J. Castrillon, A. Tretter, R. Leupers, G. Ascheid. “Communication-aware Mapping of KPN Applications Onto Heterogeneous MPSoCs” in DAC 2012, 1266-1271[Schwarzer’17] T. Schwarzer, et al. "Symmetry-eliminating design space exploration for hybrid application mapping on many-core architectures." IEEE TCAD 37.2 (2017): 297-310.

[MCSoC’18] A. Goens, C. Menard, J. Castrillon, "On the Representation of Mappings to Multicores", MCSoC-18, pp. 184–191, Hanoi, Vietnam, Sep 2018.

[CyPhy’19] M. Lohstroh, et al. "Reactors: A Deterministic Model for Composable Reactive Systems" Workshop on Design, Modeling and Evaluation of Cyber Physical Systems (CyPhy 2019) Oct 2019.

[RWDSL’18] N. A. Rink, et al. “CFDlang: High-level code generation for high-order methods in fluid dynamics”. RWDSL’18.[GPCE’17] A. Susungi, N. A. Rink, J. Castrillon, et al. “Towards Compositional and Generative Tensor Optimizations”. GPCE’17, Oct. 2017, pp. 169–175.[Chen’18] Chen, Tianqi, et al. "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning." 13th OSDI, 2018.[GPCE’18] A. Susungi, N. A. Rink, A. Cohen, J. Castrillon, and C. Tadonki. “Meta-programming for cross-domain tensor optimizations”. GPCE’18, Nov. 2018, pp. 79–92.

[Array’19] N.A. Rink, N. A. and J. Castrillon. “TeIL: a type-safe imperative Tensor Intermediate Language”, ARRAY, ACM, 2019, pp. 57-68

[Kim’19] J. Kim, et al. "A code generator for high-performance tensor contractions on GPUs." Proceedings of the Symposium on Code Generation and Optimization. IEEE Press, 2019.

[IEEE CAL’19] Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart Parkin, Jeronimo Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories”, In IEEE Computer Architecture Letters, Feb 2019.

[LCTES’19] Khan, A. A., Rink, N. A., Hameed, F., Castrillon, J. "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads”, Proceedings of LCTES, Jun 2019, pp. 5-18

49