1 compiling with multicore jeehyung lee 15-745 spring 2009

25
1 Compiling with multico re Jeehyung Lee 15-745 Spring 2009

Upload: della-ramsey

Post on 23-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

1

Compiling with multicore

Jeehyung Lee

15-745 Spring 2009

Page 2: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

2

Papers

Automatic Thread Extraction with Decoupled Software Pipelining Fully automatic Fine grained pipelining

A Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C Programs Semi-automatic Coarse grained pipelining

Page 3: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

3

First paper

Automatic Thread Extraction with Decoupled Software PipeliningGuilherme Ottoni, Ram Rangan, Adam Stol

er and David AugustFrom Princeton University

Page 4: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

4

What is the paper about?

Despite increasing uses of multiprocessors, many single threaded applications do not benefit

Let the compiler automatically extract threads and exploit lurking pipeline parallelism Extract non-speculative and truly decoupled thread

s through Decoupled Software Pipelining(DSWP)

Page 5: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

5

Why decoupled pipelining?

Example

Linked list traversal

Page 6: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

6

Why decoupled pipelining?

DOACROSS

Iteration * (LD latency + communication latency)

Page 7: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

7

Why decoupled pipelining?

DSWP

Iteration * LD latency

One way pipelining

Page 8: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

8

DSWP

Flow of data (dependency) is acyclic among cores

With use of inter-core queue, threads can be decoupled Efficiency + high tolerance for latency

Page 9: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

9

DSWP Algorithm

Build dependence graph Find strongly connected components (SCC) Create DAG of SCC Partition DAG Split codes into partitions Add flows to partitions

Page 10: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

10

Build dependence graph

Include every traditional dependence (data, control, and memory) & extensions

Page 11: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

11

Find SCC

SCC : Instructions that form a dependency cycle in a loop

Instructions in SCC cannot be parallelized

1

2

1

1

2

2

Page 12: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

12

Create DAG of SCCs

Merge instructions within each SCC and update dependency arrows

Page 13: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

13

Partition DAG

Partition DAG nodes into n partitions

( n <= # of processors) Use heuristic to maximize load balance

Decide # of partitions (threads) Start filling in from partition 1 with nodes from the

top of DAG. When the partition is stuffed (estimated by # of

cycles), move on to next partition

Find the best # of threads and its partition

Page 14: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

14

Split codes and insert flows (done!)

For each partition, insert code basic blocks relevant to its contained SCC node

Add in codes for dependency flow

Page 15: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

15

Result

19.4% speedup on important benchmark loops, 9.2% overall

When core bandwidth is halved Single threaded code slows down by 17.1% DSWP code is still slightly faster than single-thread

ed code running on full-bandwidth core

Promising enabler for Thread-Level-Parallelism(TLP)?

Page 16: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

16

Second Paper

A Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C ProgramsWilliam Thies, Vikram Chandrasekhar and S

aman AmaransingheFrom MIT

Page 17: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

17

What is the paper about?

Despite increasing uses of multiprocessors, many single threaded… (Repeated)

Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes

Let people define pipeline, and learn practical dependencies in runtime

Page 18: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

18

What is the paper about?

Despite increasing uses of multiprocessors, many single threaded… (Repeated)

Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes

Let people define stages, and learn practical dependencies in runtime …for streaming applications

Page 19: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

19

Interface

Add annotations in the body of top loop

Page 20: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

20

Dynamic analysis

The system creates a stream graph according to annotations.

How do they find dependencies?

Page 21: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

21

Dynamic analysis

Streaming applications tend to have a fixed pattern of dataflow (stable flow) among pipeline stages

Page 22: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

22

Dynamic analysis

Run the application on training examples, and record every relevant store-load pair across pipeline boundaries

This gives us practical dependencies

Page 23: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

23

Interface

Program shows a complete stream graph

User decides if he/she likes this

pipelining or not

• If yes, done!

• else, redo annotations. Iterate over until satisfied

Page 24: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

24

Actual pipelining

When compiled, annotation macros emit codes that will fork original program for each pipeline stage

Page 25: 1 Compiling with multicore Jeehyung Lee 15-745 Spring 2009

25

Result

Average 2.78x speedup, max 3.89x on 4-core Seems unsound but practical (?)