dataflow and higher level abstractions for parallel …...dataflow programming q graph...

Dataflow and higher level abstractions for parallel programmingJeronimo CastrillonChair for Compiler Construction (CCC)TU Dresden, Germany

Cyber-Physical Systems Summer School 2019Alghero, Sardinia, Italy. September 25, 2019

2

Evolution of Hardware

© Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Transistors(thousands)

Single-threadPerformance

Frequency(MHz)

Typical power(W)

Core count

M. Horowitz, F. Labonte, et al. Dotted-line by C. Moore, “Data processing in exascale-class computer systems,” The Salishan Conference on High Speed Computing, 2011

Evolution of Hardware: Great complexity

3 © Prof. J. Castrillon. CPS Summer School. Alghero, Italy, 2019

Transistors(thousands)

Single-threadPerformance

Frequency(MHz)

Typical power(W)

Core count

M. Horowitz, F. Labonte, et al. Dotted-line by C. Moore, “Data processing in exascale-class computer systems,” The Salishan Conference on High Speed Computing, 2011

q Heterogeneous many-cores, scalable platforms, complex memory hierarchies, domain-specific accelerators, …

Panda, P. R., Dutt, N. D., & Nicolau, A. Memory issues in embedded systems-on-chip: optimizations and exploration. Springer Science & Business Media. 1999

1999

Smart health

Aerospace

AI, MLNLP

Smart transport

Infotainment system

Virtual 3D

Control systems

i/po/p

Robotics HCI

DMAs, sema-

phoresPMU

Peripherals

Communicationsupport

HW queues

Network Processor

Packet DMA

MEMsubsystem

NoC

A15L1

A15L1

L2A15L1

A15L1

VLIW DSP

L1,L2

2019

Programmer

Models: Von Neumann

q Program: sequence of instructions in memory


Central Processing Unit (CPU)

Memory & I/O

Control unit Processing unit

Inst. regProg. counter

Reg. fileALU

Good abstractions:q “High-level” languages (C, …)q Efficient HW implementation

Models: Von Neumann and Modern HW

q Different sources of parallelismq SIMD, threads, GPUs, …

q Fundamentally different resourcesq Dataflow accelerators, Tensor processing units, …

è Lots of effort in auto-parallelization and vectorization


Sequential code

Parallel code(OpenMP, Pthreads,…)Compiler

Theorem (Allen/Kennedy): Any reordering transformation that preserves every dependence in a program preserves the meaning of that program

6

Static/dynamic control/data analysis: Sample results


[DAC’08]

2008

[Aguilar’18]

2018Polybenchon real Jetson TX1

Experimental multi-core from Tokyo Institute of Technology (Prof. Isshiki)

Problems for auto-parallelizing compilers

1) Find all dependencies?


for (i = 1; i <= 100; i++)for (j = 1; j <= 100; j++) {

S1: X[i][j] = X[i][j] + Y[i-l][j];S2: Y[i][j] = Y[i][j] + X[i][j-1];

}

i

j

More often than not, impossible!

2) Coding style and the illusion of infinite shared memory

struct a{struct s *a;int *pi;struct t *p;float *p;…}

Successful example: Polyhedral Compilation

Problems for auto-parallelizing compilers (2)

1) Find all dependencies?


2) Coding style and the illusion of infinite shared memory

3) Dependencies can sometimes be violated!(they are artifacts of style)

[Edler’18]

Models: Von Neumann and Modern HW

q Different sources of parallelismq SIMD, threads, GPUs, …

q Fundamentally different resourcesq Dataflow accelerators, Tensor processing unit, …

q Von Neumann and sequential programmingq Difficult to automatically extract parallelism

q Difficult to recognize high-level operations

q Difficult to ensure correctness (functional, timing, …)



Models and Programming

q Specially important for CPS!q Domain specific q Interconnected systems

q Interaction with physical world (Reactive)q Real-time (WCET, predictable HW)

q Dynamic changing conditionsq …

10

Smart health

Aerospace

AI, MLNLP

Smart transport

Infotainment system

Virtual 3D

Control systems

i/po/p

Robotics HCI

In this lectureq Classic dataflow modelingq Current work on extensions: dynamic, adaptable… (CPS)

q Raise-level of abstraction: Domain-specific languages

Dataflow programming


Dataflow programming

q Graph representation of applications: Long historyq Implicit repetitive execution of tasksq Good model for streaming applications

q Good match for signal processing & multi-media

q The whyq Explicit parallelism

q Often: Determinism q Better analyzability (scheduling, mapping, optimization)


PnTransformSdfToKpn(D, S);PnTransformToArrayAccess(D, S);CollectChannelAccessRanges(D, S);

PropagateChannelAccessRanges(D, S);PnStreamFactorystreamFactory(BasePath);

switch (transTarget) {case TransM VP:

PnTransformTemplateInstantiate(D, S);

ErasePnProcessTemplates(D);

PnPrintForM VP(D, S);break;

case TransPthread:PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;

case TransSystemC:PrintForSystemC(D, S, traces,

streamFactory);ErasePnDefs(D);

break;case TransVPUtg:

PrintForVPUtg(D, S, streamFactory);

ErasePnDefs(D);break;

case TransVPUmap:PrintForVPUmap(D, S,

strM appingFileName, streamFactory);


case TransInvalid:assert(false);break;

}}

#include "PnTransform.h"#include "PnTVPUtg.h"

#include "PnTVPUmap.h"#include "clang/AST/ASTContext.h"#include "PnStreamFactory.h"using namespace clang;

voidclang::PnTransform(TransTarget transTarget, bool traces,

const std::string & strM appingFileName, ASTContext

&Ctx,Sema &S, const

llvm::sys::Path &BasePath) {assert(transTarget != TransInvalid);TranslationUnitDecl *D =

Ctx.getTranslationUnitDecl();PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {

case TransSystemC:

case TransVPUtg:case TransVPUmap:

PnTCopying(D, S);break;

default:;

}

PnTransformPthreads(D, S, traces);ErasePnDefs(D);break;

case TransSystemC:PrintForSystemC(D, S, traces,

streamFactory);ErasePnDefs(D);break;

case TransVPUtg:PrintForVPUtg(D, S,

streamFactory);ErasePnDefs(D);break;

case TransVPUmap:PrintForVPUmap(D, S,

strM appingFileName, streamFactory);


case TransInvalid:assert(false);break;

&BasePath) {

assert(transTarget != TransInvalid);TranslationUnitDecl *D =

Ctx.getTranslationUnitDecl();PnTransformSizeof(D, S);PnTransformTask(D, S);switch (transTarget) {

case TransSystemC:case TransVPUtg:case TransVPUmap:

PnTCopying(D, S);

break;default:

;}

12

Example: Synchronous dataflow graph (SDF)


Def.: An SDF is an annotated multi-graph G=(V,E,W). V is the set of actors and E ⊆ VxV the set of channels. W = {w1,...,w|E|} ⊂N3 is a set of annotations for every channel e = (a1,a2), we=(pe,ce,de): pe is the number of tokens produced by a firing of a1, ce the tokens consumed by a2 and de the amount of delay tokens

Originally in [Lee’87]

Example: Synchronous dataflow graph (SDF)

q Example


Def.: An SDF is an annotated multi-graph G=(V,E,W). V is the set of actors and E ⊆ VxV the set of channels. W = {w1,...,w|E|} ⊂N3 is a set of annotations for every channel e = (a1,a2), we=(pe,ce,de): pe is the number of tokens produced by a firing of a1, ce the tokens consumed by a2 and de the amount of delay tokens

What next?

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

Originally in [Lee’87]

Topology matrix

q Example


Def.: Given an SDF G=(V,E,W), its topology matrix 𝚪 has a row for every channel and a column for every actor. [𝚪ik] = pik – cik (i.e. the amount of tokens produced minus those consumed by actor k on/from channel i)

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

� =

0

BB@

3 �1 06 �2 00 2 �3�2 0 1

1

CCA

Topology matrix (2)

q State of an SDF: tokens in the queuesq The topology matrix can be used to update the state after a firing


a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

s1 = s0 +

0

BB@

3 �1 06 �2 00 2 �3�2 0 1

1

CCA ·

0

@100

1

A =

0

BB@

0002

1

CCA+

0

BB@

360�2

1

CCA =

0

BB@

3600

1

CCA

� =

0

BB@

3 �1 06 �2 00 2 �3�2 0 1

1

CCA

Topology matrix (3)

q State of an SDF: tokens in the queuesq The topology matrix can be used to update the state after a firing


a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

s2 = s1 +

0

BB@

3 �1 06 �2 00 2 �3�2 0 1

1

CCA ·

0

@010

1

A =

0

BB@

3600

1

CCA+

0

BB@

�1�220

1

CCA =

0

BB@

2420

1

CCA

� =

0

BB@

3 �1 06 �2 00 2 �3�2 0 1

1

CCA

Topology matrix and complete cycles

q Complete cycle: sequence of firing that brings the SDF to the original state

q What for: Automatic unrolling to express parallelism (preserve semantics)


sn = s0 + � · (v1 + v2 + · · ·+ vn) = s0

� · (v1 + v2 + · · ·+ vn) = � · ~r = 0

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2 ~r = (1, 3, 2)T

a11 a22

a31

a21

a23

a32

Maximum cycle mean (MCM)

q Unroll SDF is an HSDFq The MCM relates to the maximum throughput of an HSDFq Given an HSDF G=(V,E)

q Let O(G) be the set of all cycles

q A cycle C=((vi,vi+1),(vi+2,vi+3),…,(vi+n-1, vi))


MCM(G) = maxC2O(G)

Pv:(u,v)_(v,u)2C

⇢(v)P

e2Cw(e)

!

x delay tokens

WCET

Other models

q Trade-off: Expressivity vs analyzability q Dynamics: Compromise on analyzability


[Stuijk‘11]

Dynamic models and language support

q Kahn Process Networks (KPNs)q Simple syntax for programmingq Less model information: more analysis effort

q Via tracing compilers canq Identify general communication patternsq Exploit event correlations

(“dependencies”)


22

Programming flow: Overview


Dynamic Dataflow Application(CPN)

Non-functional specification

PNargs_ifft_r.ID = 6U;PNargs_ifft_r.PNchannel_freq_coef = filtered_coef_rightPNargs_ifft_r.PNnum_freq_coef = 0U;PNargs_ifft_r.PNchannel_time_coef = sink_rightPNargs_ifft_r.channel = 1;sink_left = IPCllmrf_open(3, 1, 1);sink_right = IPCllmrf_open(7, 1, 1);PNargs_sink.ID = 7U;PNargs_sink.PNchannel_in_left = sink_leftPNargs_sink.PNnum_in_left = 0U;PNargs_sink.PNchannel_in_right = sink_rightPNargs_sink.PNnum_in_right = 0U;taskParams.arg0 = (xdc_UArg)&PNargs_srctaskParams.priority = 1;

ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtr&taskParams, &eb);

glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_fft_ltaskParams.priority = 1;

ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrft_Templ, &taskParams, &eb);

glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_ifft_rtaskParams.priority = 1;

ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrfft_Templ, &taskParams, &eb);

glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_sinktaskParams.priority = 1;

DMAs, sema-

phoresPMU

Peripherals

Communicationsupport

HW queues

Network Processor

Packet DMA

MEMsubsystem

NoC

A15L1

A15L1

L2A15L1

A15L1

VLIW DSP

L1,L2

Architecture model

Property models (timing, energy, errors, …)

Analysis

Synthesis

Code generation[Springer’14]

Programming flows

q Many: CAL, DOL, CPN, Daedalus, Ptolemy, CAPH, …

q Informationq Dataflow model: Rates, states, actions, traces, …

q Architecture model: Resources, interconnect, costs, …

q Optimization: Mapping application to hardware q Multi-objective optimizationq High-level simulation / cost models


LM

RISC

LM

RISC

LM

DSP

LM

DSP

LM

DSP

LM

DSP

ARM

SM

AHB Bus

LM: Local memorySM: Shared memory

qnt

src snk

qnt

qntdct

qnt

dctvle

dct

dct init

MJPEG application

Automatic mapping to heterogeneous many-cores

q Lots of research: Heuristics and meta-heuristicsq Sample results: Effort-quality trade-off


[MCSOC’16]

Multimedia apps on TI Keystone II

Dataflow and CPSsq Adaptivity, predictable, multi-app,

reactive, …


System dynamics: Hybrid mappings

q Applications not so static anymore!

q Hybrid DSE: a compile and run-time approachq Enable adaptivity: malleable, multi-variant

q Run-time predictability, robustness & isolation


Smart health

Aerospace

AI, MLNLP

Smart transport

Infotainment system

Virtual 3D

Control systems

i/po/p

Robotics HCI

EUF Pareto front @ compile-time

Data-level parallelism: Scalable and adaptive

q Change parallelism from the application specificationq Static code analysis to identify possible transformations (or via annotations)q Implementation in FIFO library (semantics preserving)q Similar to AdaPNet or parameterized SDFs


P1 P2 P3

Meta-information & configurations

Changes @ runtime

P1

P12

P3P22

P32

P1

P12

P3P32

P42

or

P22

[PARMA-DITAM’18]

[Schor’14, Desnos’13]

Flexible mappings: Generalized Tetris

q Given multiple canonical configs by compiler, select one at run-time

q Exploit mapping equivalences and similarities

q Related approach: Produce mapping constraints at compile-time


Mapping 3 configuration1Mapping 2

configuration1Mapping configuration1

RuntimeMapping

configuration*

[ACM TACO’17, SCOPES’17b]

[Schwarzer’17]

Implemented as actual game for dissemination!

Sample symmetries for MJPEG

q Multimedia benchmarks on Adapteva (16 Epiphany cores)q Not always very intuitive


R R

RR

R

R

R

R

R R

RR

R

R

R

R

R R

RR

R

R

R

R

R R

RR

R

R

R

R

R R

RR

R

R

R

R

R R

RR

R

R

R

R

R R

RR

R

R

R

R

R R

RR

R

R

R

R

Best result Result “simple” Pruned (equiv. to “simple”) Pruned (equiv. to best)

[ACM TACO’17]

Flexible mappings: Run-time analysis

q Linux kernel: symmetry-awareq Target: Odroid XU4 (big.LITTLE)q Multi-application scenarios: audio filter (AF) and MIMO

q 1x AF

q 4 x AF q 2 x AF + 2 x MIMO

q 3 mappings to two processors q T1: Best CPU time

q T2: Best wall-clock timeq T3: GBM heuristic


●●●

12

14

16

18

CFS Dyn T1 T2 T3

CPU

tim

e [s

]

●

●

●

●●

●●

●

●

10

12

14

16

CFS Dyn T1 T2 T3

Ener

gy [J

]

●

●

●

●●●●

●●

4

5

6

7

8

CFS Dyn T1 T2 T3

Wal

l−cl

ock

time

[s]

Single AF

[Castrill12]

[SCOPES’17b]

Flexible mappings: Multi-application results (1)


●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●● ●

●

●

●

●

●

●●

●

●●

●●

●

●

●●

6.5

7.0

7.5

8.0

8.5

9.0

9.5

CFS Dyn T1 T2 T3

Wal

l−cl

ock

time

[s]

●●

●●●●●●

●●●

●●●●●●●●●●●●●

●●●

●

●●●●●●

●●●●●●●

10

20

30

40

AF MIMO

CPU

time

[s]

●●●● ●●

●● ●●●●●●●

●●●

●●●●

●● ●●

10

20

30

AF MIMO

Wal

l−clo

ck ti

me

[s] More

predictable performance

Comparable performance to dynamic mapping

●

●

●

●

●●●●●●

●

●●●●

●

●●●

●

● ● ● ●

● ●

●●

●

●●

●

●●

● ● ● ●

12.5

15.0

17.5

20.0

CFS Dyn T1 T2 T3

CPU

tim

e [s

]

[SCOPES’17b]

Robustness

q Static mappings, transformed or not, provide good predictabilityq However: Many things out of control

q Application data, unexpected interrupts, unexpected OS decisions

è Can we reason about robustness of mapping to external factors?


Mapping 3 configuration1Mapping 2

configuration1Mapping configuration1

RuntimeMapping

configuration* Perturb Modified behavior(correct?)

Design centering

q Design centering: Find a mapping that can better tolerate variations while staying feasible

q Studied field, in e.g., biology, circuit design or manufacturing systems.

q Currentlyq Using a bio-inspired algorithmq Robust against OS changes to the mapping


optimum

design center in feasible region

x1

x2

feasible region

[SCOPES’17a]

Design centering: Algorithmic

q Intuition: Find the center and the form of a region, in which parameters deliver a correct solution

q Formally q A: Set of correct solutionsq P: Hitting probability

q L: Generic metric space

q Searching: Allow annealing (dynamically change P)


[SCOPES’17a]

Evaluation

q Analyze how robust the center really isq Perturbate mappings and check how often the constraints are missedq Signal processing applications on clustered ARM manycore and NoC manycore (16)


MIMO-OFDM

[SCOPES’17a]

Ongoing work: Improve representations

q Work on embeddings: Architectures à Real numbers q Novel mapping representations: faster heuristics & more efficient heuristics?q Example: T-SNE Visualization for mappings space (8 tasks on Odroid XU4)


[MCSoC’18]

Domain-specific abstractions


Higher-level Abstractions

q Dataflow: Nice abstraction across domains (maybe too overloaded)q Need to match domain-specific accelerators with domain-specific abstractions


https://www.hpcwire.com/2017/04/10/nvidia-responds-google-tpu-benchmarking/

var input A : matrixvar input u : tensorIN

v = (A # A # A # u . [[5 8] [3 7] [1 6]])

Domain-specific language

[RWDSL’18]

Example: CFDlang

q Origin: Spectral element methods in Computational Fluid Dynamics

q Simple expression language, formal semantics, domain expert optimizations


source = type matrix : [mp np] &type tensorIN : [np np np ne] &type tensorOUT : [mp mp mp me] &

&var input A : matrix &var input u : tensorIN &var input output v : tensorOUT &var input alpha : [] &var input beta : [] &

&v = alpha * (A # A # A # u .

[[5 8] [3 7] [1 6]]) + beta * v

Fortran embedding

[GPCE’17, RWDSL’18]

Tensor languages (hype): Machine learning

q High-level dataflow graph, low-level tensor operators

q Many existing frameworks, e.g., TVM, Tensor Comprehensions, TensorFlow, …

q Recent workq Cross-domain tensor transformations

q Formal semantics for safety checks(particularly important for CPS!)

q Optimization for emerging memories


[Chen’18]Example flow: TVM

[GPCE’18]

[Array’19]

[LCTES’19]

Correct by construction

q Similar to formal MoCsq Correct by designq No abstraction leaks

q Current efforts in formal semantics for safe code generation


[Array’19]

A = placeholder((m,h), name='A')

B = placeholder((n,h), name='B')

k = reduce_axis((0, h), name='k')

C = compute((m, n), lambda i, j:

sum(A[k, i] * B[k, j], axis=k))

𝐶%& = ()*+

,

𝐴)%𝐵)&

Emerging architectures

q Lots of research on tensor computations on tensor accelerators! q Recently looking into emerging memory architectures

q Hybrid architectures STT/Re-RAMq Racetrack memories

q Racetrack memories: Predicted extreme density at low latencyq Multiple bits stored in nano-wires with magnetic domainsq Non-volatile: Energy efficiency!


e.g., [Kim’19]

[TVLSI’18, TVLSI’19]

Need to exploit sequentially on top of locality

[IEEE CAL’19]

Architecture and data layout optimization

q Architecture – software co-optimizationq Embedded system for inference: RTM as scratchpad with pre-shifting and other

optimizations

43[LCTES’19]


Architecture and data layout optimization

q Data-layout: Reduce the number of shifts

44[LCTES’19]

n

DB

C:2n

DB

C:2n+1

DB

C:3n-1

R0

R1

Rn-1

C00 C01

n

R0

R1

Rn-1

DB

C:0

DB

C:1

DB

C:n-1

A00 A01 A0n-1

A10 A1n-1

An-1n-1

C0 Cn-1DBC:n DBC:n+1 DBC:2n-1

B00

B10

Bn-10

A00 A01 A0n-1

compulsory shifts

overhead shifts

B00 B10 Bn-10

compulsory shifts

overhead shifts

C1

B01

B11

Bn-11

C (Bank-2)Ã (Bank-0) ~B (Bank-1)


Latency comparison vs SRAM

q Un-optimized and naïve mapping: Even worse latency than SRAMq 24% average improvement (even with very conservative circuit simulation)

45

0

0.5

1

1.5

2

2.5

3

4 8 16 32 64 128 256 512 1024 2048 Avg

Nor

mal

ized

late

ncy

Tensor size

SRAM RTM-naïve RTM-opt RTM-opt-ps

[LCTES’19]


Energy comparison vs SRAM

q Higher savings due to less leakage powerq 74% average improvement

46

0

0.2

0.4

0.6

0.8

1

1.2

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

SRA

M

RTM

-naï

ve

RTM

-opt

RTM

-opt

-ps

4 8 16 32 64 128 256 512 1024 2048

Nor

mal

ized

ene

rgy

Tensor size

Leakage Energy Read/Write Energy Shift Energy

[LCTES’19]


Summary


Summary

q Principled methodologies for programming are a must! – preaching to the choirq Advances in auto-parallelization (polyhedral, dynamic analysis)q Explicit parallel MoC-based programming models

q Quest for more adaptivity but retaining time predictability q Higher-level abstractions: Example for tensor-based computations for accelerators

q Lots of challenges remain (thankfully)q Cost models and characterization of trade-offs (vs. blind searches)q Understand impact of emerging technologies

q Syntax and semantics (for correctness): Lots of open questions q Time-semantics in programming and execution environments



References

[DAC’08] Ceng, J., et al. "MAPS: an integrated framework for MPSoC application parallelization.”, Proceedings of the 45th annual Design Automation Conference. DAC’08.[Aguilar’18] Aguilar, M. A., Software Parallelization and Distribution for Heterogeneous Multi-Core Embedded Systems. RWTH Aachen University, RWTH Aachen University, 2018, 181pp[Edler’18] T. J.K. Edler von Koch, S. Manilov, C. Vasiladiotis, M. Cole, and B.Franke. “Towards a Compiler Analysis for Parallel Algorithmic Skeletons”, CC’18.[Lee’87] E.A. Lee and D. G. Messerschmitt. "Synchronous data flow." Proceedings of the IEEE 1987[Springer’14] J. Castrillon, R. Leupers, "Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap" , Springer, pp. 258, 2014. [MCSoC’16] A. Goens, R. Khasanov, J. Castrillon, S. Polstra, A. Pimentel, "Why Comparing System-level MPSoC Mapping Approaches is Difficult: a Case Study" , MCSoC-16.[Schor’14] Schor, L.; Bacivarov, I.; Yang, H. & Thiele, L. “AdaPNet: Adapting Process Networks in Response to Resource Variations”, ESWEEK'14, 2014

[Stuijk’11] Stuijk, S.; Geilen, M.; Theelen, B. & Basten, T. “Scenario-aware dataflow: Modeling, analysis and implementation of dynamic applications”, SAMOS’11 404-411

[Desnos’13] Desnos, K., et al. "Pimm: Parameterized and interfaced dataflow meta-model for mpsocsruntime reconfiguration." SAMOS), 2013.

[PARMA-DITAM’18] R. Khasanov, et al, "Implicit Data-Parallelism in Kahn Process Networks: Bridging the MacQueen Gap" , PARMA-DITAM 18, ACM, pp. 20–25, Jan 2018.

[ACM TACO’17] Goens, A. et al. “Symmetry in Software Synthesis”. In: ACM Transactions on Architecture and Code Optimization (TACO) (2017).[SCOPES’17a] G. Hempel, et al, "Robust Mapping of Process Networks to Many-Core Systems Using Bio-Inspired Design Centering" SCOPES’17.[SCOPES’17b] Goens, A. et al. “TETRiS: a Multi-Application Run-Time System for Predictable Execution of Static Mappings”, SCOPES’17.

[DAC’12] J. Castrillon, A. Tretter, R. Leupers, G. Ascheid. “Communication-aware Mapping of KPN Applications Onto Heterogeneous MPSoCs” in DAC 2012, 1266-1271[Schwarzer’17] T. Schwarzer, et al. "Symmetry-eliminating design space exploration for hybrid application mapping on many-core architectures." IEEE TCAD 37.2 (2017): 297-310.

[MCSoC’18] A. Goens, C. Menard, J. Castrillon, "On the Representation of Mappings to Multicores", MCSoC-18, pp. 184–191, Hanoi, Vietnam, Sep 2018.

[CyPhy’19] M. Lohstroh, et al. "Reactors: A Deterministic Model for Composable Reactive Systems" Workshop on Design, Modeling and Evaluation of Cyber Physical Systems (CyPhy 2019) Oct 2019.

[RWDSL’18] N. A. Rink, et al. “CFDlang: High-level code generation for high-order methods in fluid dynamics”. RWDSL’18.[GPCE’17] A. Susungi, N. A. Rink, J. Castrillon, et al. “Towards Compositional and Generative Tensor Optimizations”. GPCE’17, Oct. 2017, pp. 169–175.[Chen’18] Chen, Tianqi, et al. "TVM: An Automated End-to-End Optimizing Compiler for Deep Learning." 13th OSDI, 2018.[GPCE’18] A. Susungi, N. A. Rink, A. Cohen, J. Castrillon, and C. Tadonki. “Meta-programming for cross-domain tensor optimizations”. GPCE’18, Nov. 2018, pp. 79–92.

[Array’19] N.A. Rink, N. A. and J. Castrillon. “TeIL: a type-safe imperative Tensor Intermediate Language”, ARRAY, ACM, 2019, pp. 57-68

[Kim’19] J. Kim, et al. "A code generator for high-performance tensor contractions on GPUs." Proceedings of the Symposium on Code Generation and Optimization. IEEE Press, 2019.

[IEEE CAL’19] Asif Ali Khan, Fazal Hameed, Robin Bläsing, Stuart Parkin, Jeronimo Castrillon, "RTSim: A Cycle-accurate Simulator for Racetrack Memories”, In IEEE Computer Architecture Letters, Feb 2019.

[LCTES’19] Khan, A. A., Rink, N. A., Hameed, F., Castrillon, J. "Optimizing Tensor Contractions for Embedded Devices with Racetrack Memory Scratch-Pads”, Proceedings of LCTES, Jun 2019, pp. 5-18

49

dataflow and higher level abstractions for parallel …...dataflow programming q graph...

Documents