static translation of stream programming to a parallel system

24
Static Translation of Stream Programming to a Parallel System S. M. Farhad PhD Student Supervisor: Dr. Bernhard Scholz Programming Language Group School of Information Technology University of Sydney

Upload: vicky

Post on 05-Jan-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Static Translation of Stream Programming to a Parallel System. S. M. Farhad PhD Student Supervisor : Dr. Bernhard Scholz Programming Language Group School of Information Technology University of Sydney. Uniprocessor Performance. Picochip PC102. Ambric AM2045. Cisco CSR-1. Intel - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Static Translation of Stream Programming to a Parallel System

Static Translation of Stream Programming to a Parallel

SystemS. M. FarhadPhD Student

Supervisor: Dr. Bernhard ScholzProgramming Language Group

School of Information TechnologyUniversity of Sydney

Page 2: Static Translation of Stream Programming to a Parallel System

Uniprocessor Performance

Page 3: Static Translation of Stream Programming to a Parallel System

Motivation

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005 20??

# ofcores

1

2

4

8

16

32

64

128

256

512

Athlon

Raw

Power4Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Broadcom 1480 Opteron 4P

Xeon MP

AmbricAM2045

Page 4: Static Translation of Stream Programming to a Parallel System

Motivation

For uniprocessors,C was:•Portable•High Performance•Composable•Malleable•Maintainable

Uniprocessors:C is the commonmachine language

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005

Raw

Power4Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Broadcom 1480

20??

# ofcores

1

2

4

8

16

32

64

128

256

512

Opteron 4P

Xeon MP

Athlon

AmbricAM2045

Page 5: Static Translation of Stream Programming to a Parallel System

Motivation

What is the commonmachine languagefor multicores?

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005

Raw

Power4Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Broadcom 1480

20??

# ofcores

1

2

4

8

16

32

64

128

256

512

Opteron 4P

Xeon MP

Athlon

AmbricAM2045

Page 6: Static Translation of Stream Programming to a Parallel System

Common Machine Languages

Common Properties

Single flow of control

Single memory image

Uniprocessors:

Differences:

Register File

ISA

Functional Units

Register AllocationInstruction Selection

Instruction Scheduling

Common Properties

Multiple flows of control

Multiple local memories

Multicores:

Differences:

Number and capabilities of cores

Communication Model

Synchronization Model

von-Neumann languages represent the common properties and abstract away the differences

Stream Programming Language is acommon machine language for multicores

Page 7: Static Translation of Stream Programming to a Parallel System

Properties of Stream Programs [W. Thies ‘02]

• A large (possibly infinite) amount of data• Limited lifespan of each data item• Little processing of each data item

• A regular, static computation pattern• Stream program structure is relatively

constant• A lot of opportunities for compiler

optimizations

Page 8: Static Translation of Stream Programming to a Parallel System

Application of Streaming Programming

Page 9: Static Translation of Stream Programming to a Parallel System

Model of Computation

• Synchronous Dataflow [Lee ‘92]– Graph of autonomous filters– Communicate via FIFO channels

• Static I/O rates [Edward ‘87]– Compiler decides on an order

of execution (schedule)– Static estimation of

computationAdder

Speaker

AtoD

FMDemod

Scatter

Gather

LPF2 LPF3

HPF2 HPF3

LPF1

HPF1

Page 10: Static Translation of Stream Programming to a Parallel System

parallel computation

StreamIt Language Overview [Thies ‘04]

• StreamIt is a novel language for streaming– Exposes parallelism and

communication– Architecture independent– Modular and composable

• Simple structures composed to creates complex graphs

– Malleable• Change program behavior

with small modifications

may be any StreamIt language construct

joinersplitter

pipeline

feedback loop

joiner splitter

splitjoin

filter

Page 11: Static Translation of Stream Programming to a Parallel System

11

Mapping of Filters to Multicores

• Task Parallelism [Edward ‘87]• Fine-Grained Data Parallelism [Michael ‘06]• 3-phase solution [Michael ’06]• Orchestrating the Execution of Stream Programs

[Kudlur ‘08]

Page 12: Static Translation of Stream Programming to a Parallel System

12

Baseline 1: Task Parallelism

Adder

Splitter

Joiner

Compress

BandPass

Expand

Process

BandStop

Compress

BandPass

Expand

Process

BandStop

• Inherent task parallelism between two processing pipelines

• Task Parallel Model:– Only parallelize explicit

task parallelism – Fork/join parallelism

• Execute this on a 2 core machine ~2x speedup over single core

Page 13: Static Translation of Stream Programming to a Parallel System

13

Baseline 2: Fine-Grained Data Parallelism

Adder

Splitter

Joiner

• Each of the filters in the example are stateless

• Fine-grained Data Parallel Model:– Fiss each stateless filter N

ways (N is number of cores)– Remove scatter/gather if

possible

• We can introduce data parallelism– Example: 4 cores

• Each fission group occupies entire machineBandStopBandStopBandStopAdder

Splitter

Joiner

ExpandExpandExpand

ProcessProcessProcess

Joiner

BandPassBandPassBandPass

CompressCompressCompress

BandStopBandStopBandStop

Expand

BandStop

Splitter

Joiner

Splitter

Process

BandPass

Compress

Splitter

Joiner

Splitter

Joiner

Splitter

Joiner

ExpandExpandExpand

ProcessProcessProcess

Joiner

BandPassBandPassBandPass

CompressCompressCompress

BandStopBandStopBandStop

Expand

BandStop

Splitter

Joiner

Splitter

Process

BandPass

Compress

Splitter

Joiner

Splitter

Joiner

Splitter

Joiner

Page 14: Static Translation of Stream Programming to a Parallel System

14

3-Phase Solution [Michael ‘06]

RectPolar

Splitter

Joiner

AdaptDFT AdaptDFT

Splitter

Splitter

Amplify

Diff

UnWrap

Accum

Amplify

Diff

Unwrap

Accum

Joiner

Joiner

PolarRect

66

20

2

1

1

1

2

1

1

1

20 Data Parallel

Data Parallel

Target a 4 core machine

Data Parallel, but too little work!

Page 15: Static Translation of Stream Programming to a Parallel System

15

Data Parallelize

RectPolarRectPolarRectPolar

Splitter

Joiner

AdaptDFT AdaptDFT

Splitter

Splitter

Amplify

Diff

UnWrap

Accum

Amplify

Diff

Unwrap

Accum

Joiner

RectPolar

Splitter

Joiner

RectPolarRectPolarRectPolarPolarRect

Splitter

Joiner

Joiner

66

20

2

1

1

1

2

1

1

1

20

5

5

Target a 4 core machine

Page 16: Static Translation of Stream Programming to a Parallel System

16

Data + Task Parallel Execution

Time

Cores

21

Target 4 core machine

Splitter

Joiner

Splitter

Splitter

Joiner

Splitter

Joiner

RectPolarSplitter

Joiner

Joiner

66

2

1

1

1

2

1

1

1

5

5

Page 17: Static Translation of Stream Programming to a Parallel System

17

Better Mapping

Time

Cores

Target 4 core machine

Splitter

Joiner

Splitter

Splitter

Joiner

Splitter

Joiner

RectPolarSplitter

Joiner

Joiner

66

2

1

1

1

2

1

1

1

5

5

16

Page 18: Static Translation of Stream Programming to a Parallel System

18

Phase 3: Coarse-Grained Software Pipelining

RectPolar

RectPolar

RectPolar

RectPolar

Prologue

New Steady

State

• New steady-state is free of dependencies

• Schedule new steady-state using a greedy partitioning

Page 19: Static Translation of Stream Programming to a Parallel System

19

Greedy Partitioning [Michael ‘06]

Target 4 core machine

Time 16

CoresTo Schedule:

Page 20: Static Translation of Stream Programming to a Parallel System

Static Translation of Stream Programs [Proposal]

• We study – A mathematical model and algorithms to resolve

bottlenecks in stream programs– Map actors of stream programs to processors in a

parallel systems– Compute a schedule for each processor

• Goal is to statically optimize the throughput of a stream program

• Assuming constant input bandwidth

Page 21: Static Translation of Stream Programming to a Parallel System

Research Question: Removing the bottleneck from the stream graph

A

B C

D

Original stream graph

Filter B is the bottleneck

A

C

D

B BM

S

J

After removing the bottleneck

Filter B is duplicated

Page 22: Static Translation of Stream Programming to a Parallel System

Research Method

• Perform a quantitative analysis that detects bottlenecks in the stream graph

• The bottleneck resolver duplicates actors that impose a bottleneck.

• The process continues until the program is bottleneck free

• Then mapping the actors to processors is performed via Integer Linear Programming

Page 23: Static Translation of Stream Programming to a Parallel System

Plan

• Background study

• Research question

• Proposal

• Implementation

• Results

• Publication

Page 24: Static Translation of Stream Programming to a Parallel System

Question?