1 compiling with multicore jeehyung lee 15-745 spring 2009

1

Compiling with multicore

Jeehyung Lee

15-745 Spring 2009

2

Papers

Automatic Thread Extraction with Decoupled Software Pipelining Fully automatic Fine grained pipelining

A Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C Programs Semi-automatic Coarse grained pipelining

3

First paper

Automatic Thread Extraction with Decoupled Software PipeliningGuilherme Ottoni, Ram Rangan, Adam Stol

er and David AugustFrom Princeton University

4

What is the paper about?

Despite increasing uses of multiprocessors, many single threaded applications do not benefit

Let the compiler automatically extract threads and exploit lurking pipeline parallelism Extract non-speculative and truly decoupled thread

s through Decoupled Software Pipelining(DSWP)

5

Why decoupled pipelining?

Example

Linked list traversal

6


DOACROSS

Iteration * (LD latency + communication latency)

7


DSWP

Iteration * LD latency

One way pipelining

8

DSWP

Flow of data (dependency) is acyclic among cores

With use of inter-core queue, threads can be decoupled Efficiency + high tolerance for latency

9

DSWP Algorithm

Build dependence graph Find strongly connected components (SCC) Create DAG of SCC Partition DAG Split codes into partitions Add flows to partitions

10

Build dependence graph

Include every traditional dependence (data, control, and memory) & extensions

11

Find SCC

SCC : Instructions that form a dependency cycle in a loop

Instructions in SCC cannot be parallelized

1

2

1

1

2

2

12

Create DAG of SCCs

Merge instructions within each SCC and update dependency arrows

13

Partition DAG

Partition DAG nodes into n partitions

( n <= # of processors) Use heuristic to maximize load balance

Decide # of partitions (threads) Start filling in from partition 1 with nodes from the

top of DAG. When the partition is stuffed (estimated by # of

cycles), move on to next partition

Find the best # of threads and its partition

14

Split codes and insert flows (done!)

For each partition, insert code basic blocks relevant to its contained SCC node

Add in codes for dependency flow

15

Result

19.4% speedup on important benchmark loops, 9.2% overall

When core bandwidth is halved Single threaded code slows down by 17.1% DSWP code is still slightly faster than single-thread

ed code running on full-bandwidth core

Promising enabler for Thread-Level-Parallelism(TLP)?

16

Second Paper

A Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C ProgramsWilliam Thies, Vikram Chandrasekhar and S

aman AmaransingheFrom MIT

17


Despite increasing uses of multiprocessors, many single threaded… (Repeated)

Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes

Let people define pipeline, and learn practical dependencies in runtime

18


Despite increasing uses of multiprocessors, many single threaded… (Repeated)

Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes

Let people define stages, and learn practical dependencies in runtime …for streaming applications

19

Interface

Add annotations in the body of top loop

20

Dynamic analysis

The system creates a stream graph according to annotations.

How do they find dependencies?

21

Dynamic analysis

Streaming applications tend to have a fixed pattern of dataflow (stable flow) among pipeline stages

22

Dynamic analysis

Run the application on training examples, and record every relevant store-load pair across pipeline boundaries

This gives us practical dependencies

23

Interface

Program shows a complete stream graph

User decides if he/she likes this

pipelining or not

• If yes, done!

• else, redo annotations. Iterate over until satisfied

24

Actual pipelining

When compiled, annotation macros emit codes that will fork original program for each pipeline stage

25

Result

Average 2.78x speedup, max 3.89x on 4-core Seems unsound but practical (?)

1 compiling with multicore jeehyung lee 15-745 spring 2009

Documents