1 compiling with multicore jeehyung lee 15-745 spring 2009
TRANSCRIPT
1
Compiling with multicore
Jeehyung Lee
15-745 Spring 2009
2
Papers
Automatic Thread Extraction with Decoupled Software Pipelining Fully automatic Fine grained pipelining
A Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C Programs Semi-automatic Coarse grained pipelining
3
First paper
Automatic Thread Extraction with Decoupled Software PipeliningGuilherme Ottoni, Ram Rangan, Adam Stol
er and David AugustFrom Princeton University
4
What is the paper about?
Despite increasing uses of multiprocessors, many single threaded applications do not benefit
Let the compiler automatically extract threads and exploit lurking pipeline parallelism Extract non-speculative and truly decoupled thread
s through Decoupled Software Pipelining(DSWP)
5
Why decoupled pipelining?
Example
Linked list traversal
6
Why decoupled pipelining?
DOACROSS
Iteration * (LD latency + communication latency)
7
Why decoupled pipelining?
DSWP
Iteration * LD latency
One way pipelining
8
DSWP
Flow of data (dependency) is acyclic among cores
With use of inter-core queue, threads can be decoupled Efficiency + high tolerance for latency
9
DSWP Algorithm
Build dependence graph Find strongly connected components (SCC) Create DAG of SCC Partition DAG Split codes into partitions Add flows to partitions
10
Build dependence graph
Include every traditional dependence (data, control, and memory) & extensions
11
Find SCC
SCC : Instructions that form a dependency cycle in a loop
Instructions in SCC cannot be parallelized
1
2
1
1
2
2
12
Create DAG of SCCs
Merge instructions within each SCC and update dependency arrows
13
Partition DAG
Partition DAG nodes into n partitions
( n <= # of processors) Use heuristic to maximize load balance
Decide # of partitions (threads) Start filling in from partition 1 with nodes from the
top of DAG. When the partition is stuffed (estimated by # of
cycles), move on to next partition
Find the best # of threads and its partition
14
Split codes and insert flows (done!)
For each partition, insert code basic blocks relevant to its contained SCC node
Add in codes for dependency flow
15
Result
19.4% speedup on important benchmark loops, 9.2% overall
When core bandwidth is halved Single threaded code slows down by 17.1% DSWP code is still slightly faster than single-thread
ed code running on full-bandwidth core
Promising enabler for Thread-Level-Parallelism(TLP)?
16
Second Paper
A Practical Approach to Exploring Coarse-Grained Pipeline Parallelism in C ProgramsWilliam Thies, Vikram Chandrasekhar and S
aman AmaransingheFrom MIT
17
What is the paper about?
Despite increasing uses of multiprocessors, many single threaded… (Repeated)
Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes
Let people define pipeline, and learn practical dependencies in runtime
18
What is the paper about?
Despite increasing uses of multiprocessors, many single threaded… (Repeated)
Coarse grained pipelining is more desirable, but is especially hard with obfuscated C codes
Let people define stages, and learn practical dependencies in runtime …for streaming applications
19
Interface
Add annotations in the body of top loop
20
Dynamic analysis
The system creates a stream graph according to annotations.
How do they find dependencies?
21
Dynamic analysis
Streaming applications tend to have a fixed pattern of dataflow (stable flow) among pipeline stages
22
Dynamic analysis
Run the application on training examples, and record every relevant store-load pair across pipeline boundaries
This gives us practical dependencies
23
Interface
Program shows a complete stream graph
User decides if he/she likes this
pipelining or not
• If yes, done!
• else, redo annotations. Iterate over until satisfied
24
Actual pipelining
When compiled, annotation macros emit codes that will fork original program for each pipeline stage
25
Result
Average 2.78x speedup, max 3.89x on 4-core Seems unsound but practical (?)