mapping stream programs onto heterogeneous multiprocessor systems [by barcelona supercomputing...

Mapping Stream Programs onto HeterogeneousMultiprocessor Systems

[by Barcelona Supercomputing Centre, Spain, Oct 09]

S. M. FarhadProgramming Language Group

School of Information TechnologyThe University of Sydney

1

Abstract

Presents a partitioning and allocation algorithm for stream compiler Targeting heterogeneous multiprocessors with constrained

distributed memory and communication topology Introduce a novel definition of connectedness

Which enables the algorithm to model the capabilities of the compiler

The algorithm uses convexity and connectedness constraints to produce partitions Easier to compile and require short pipelines

Shows StreamIt 2.1.1 benchmarks results for an SMP, 2 X 2 mesh, SMP plus accelerator and IMP QS20 blade which are within 5% of the optimum

2

Motivation

Recent trends: an increasing number of on-chip processor cores Intel IXP2850 [16], TI OMAP [7], Nexperia Home Platform [9],

and ST Nomadik Programs using explicit threads have to be tuned to match

the target The interesting thing is to automatically map a portable

program onto a heterogeneous multiprocessor Many multimedia and radio applications contain abundant

task and data parallelism But it is hard to extract from C code Stream languages represent the application as

independent actors communicating via point-to-point streams

3

Overview

The input to the partitioning algorithm is the program graph of kernels and streams

The output is the mapping to fuse kernels to tasks, and to allocate tasks to processors

The difference between kernels and tasks, Kernels: the work function of an actor Tasks: which are present in the executable, contain

multiple kernels and are scheduled on a known processor The algorithm requires

The average data rate on each stream, and (SDF: statically)

The average load of each kernel on each processor (profiling)

4

The ACOTES Stream Compiler This work is part of the ACOTES European

project Developing an open source stream compiler for

embedded systems Will automatically map a stream program onto a

multicore system Program will be written in SPM: an annotated C

programming language The target is represented using the ASM:

supports heterogeneous platforms

The Mapping Phase

Blockingand splitting

Partitioning

Software pipeliningand queue length

assignment

Initial Partitioning

Merge tasks

Move bottlenecks

Create tasks

Reallocate tasks

A polyhedral optimization pass unrolls to aggregate computation

and communication, andsplits stateless kernels for greater

parallelism

Partitioning algorithm described in this paper

Performs softwarepipelining and allocates memory for the stream buffers

Convex Connected Partitions

Convexity: a partition is convex if the graph of dependencies between tasks is acyclic

Equivalently, every directed path between two kernels in the same task is internal to that task

The convexity constraint is for Avoiding long software pipelines

Require long pipelines for a small increase in throughput (unaware of pipelining cost)

8

Quick Review: Software Pipelining

RectPolar

RectPolar

RectPolar

RectPolar

Prologue

New Steady

State

• New steady-state is free of dependencies

The number of pipeline stages

Convex Partition

Data flowis from processor p1 (black) to p2 (grey) and p3 (white),

and from p2 to p3—an acyclic graph

Connectedness

The connectedness constraint To help code

generation It is easier to fuse

adjacent kernels

If k2 and k3 are fused into one task, then the entire graph must be fused

Connectedness Contd.

A naïve definition: considers a partition to be connected when each processor has a weakly connected subgraph

Unfortunately, wide split-joins, as in filterbank, do not usually have good partitions subject to this constraint

Because strict connectedness Fig (c) performs 28% worse than Fig (a)

Generalize connectivity as a set of basic connected sets [Joseph’06]

Formalization of the Problem

The target is represented as an undirected bipartite graph H = (V, E), where

is the set of vertices, a disjoint union of processors, P, and interconnects, I and E is the set of edges

Processor weight: wp Clock speed in GHz Interconnect weight: wu is bandwidth in GB/s Static route between processor p and q: , if

it uses interconnect u, and 0 otherwise In general,

14

IPV

1upqr

uqp

upq rr

Topology of the Targets

15

Processorsa and b unable to

execute code; therefore

= wa = wb = 0

Formalization of the Problem Contd. The program is represented as a directed

acyclic graph, G = (K, S), where K is the set of kernels, and S is the set of

streams Load of kernel i on processor p, denoted cip,

is the mean number of gigacycles in some fixed time period tau

Load of stream ij, denoted cij is the mean number of gigabytes transferred in time tau

16

Formalization of the Problem Contd. The output of the algorithm is two map

functions Firstly, T maps kernels onto tasks, and Secondly, P maps tasks onto processors

The partition implied by T must be convex, so the graph of dependencies between tasks is acyclic

17

Formalization of the Problem Contd. The cost on processor p or interconnect u is

The goal is to find the allocation (T, P), which minimises the maximum values of all the Cp and Cu

Subject to the convexity and connectedness constraints

18

qp

p

KjKiu

ij

Pqp

upq

u

Kip

ipp

w

crC

w

cC

,,

Partitioning the Target

The algorithm First divides the target into two subgraphs, P1 and

P2, and An aggregate interconnect, I, balancing two

objectives: The subgraphs should have roughly equal total CPU

performance, and The aggregate interconnect bandwidth between them

should be low

19

Partitioning the Target Cond.

Communications bottleneck

The target is divided into halves to maximise alpha

20

21 ,max

PqPpu

uqp

upq

Iu w

rrC

21

,minPq

q

Pp

p wwC

They find an approximate solution using a variant of theKernighan and Lin partitioning algorithm

Partitioning the Program

The program graph is given edge and vertex weights

The edge weight for stream ij, denoted cij is the cost in cycles in time tau, if assigned to the aggregate interconnect

The vertex weight for kernel i is a pair (ciP1, ciP2), The goal is to find a two-way partition {K1,K2} to

minimize the bottleneck given by:

21

212

2

1

1

,

,,maxKjKiij

Kj

jP

Ki

iP cccc

Partitioning the Program Contd. The partitioning algorithm is a branch and bound

search Each node in the search tree inherits a partial

partition (K1, K2), and unassigned vertices X; at the root K1 = K2 = 0 and X = K

The minimal cost, cK1K2 , for all partitions in the subtree rooted at node (K1,K2)

22

2121

21

,

,,maxKjKiij

Ki

iB

Ki

iAKK ccclbc

1111

21

,

,,maxKjKiij

Ki

iB

Ki

iAKK cccubc

The Branch and Bound Search

23

qp

p

KjKiu

ij

Pqp

upq

u

Kip

ipp

w

crC

w

cC

,,

2121

21

,

,,maxKjKiij

Ki

iB

Ki

iAKK ccclbc

1111

21

,

,,maxKjKiij

Ki

iB

Ki

iAKK cccubc

Refinement of the Partition

The refinement stage starts with a valid initial partition Merge tasks Move bottlenecks Create tasks Reallocate tasks

24

Evaluation

25

Convergence of the Refinement Phase

26

Summary

A fast and robust partitioning algorithm for an iterative stream compiler

The algorithm maps an unstructured variable data rate stream program onto a heterogeneous multiprocessor system with any communications topology

The algorithm favours convex connected partitions, which do not require software pipelining and are easy to compile

The performance is, on average, within 5% of the optimum performance

27

Question?

Thank you

28

mapping stream programs onto heterogeneous multiprocessor systems [by barcelona supercomputing...

Documents

acotes stream compiler

stream buffers

actor tasks

mapping stream programs

program graph of kernels

stateless kernels

multiple kernels

open source stream compiler