automatically tuning task-based programs for multi-core processors
DESCRIPTION
Automatically Tuning Task-Based Programs for Multi-core Processors. Jin Zhou Brian Demsky Department of Electrical Engineering and Computer Science University of California, Irvine. Motivation. Recent microprocessor trends Number of cores increased rapidly Architectures vary widely - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/1.jpg)
Automatically Tuning Task-Based Programs for Multi-core Processors
Jin ZhouBrian Demsky
Department of Electrical Engineering and Computer Science
University of California, Irvine
![Page 2: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/2.jpg)
Motivation
• Recent microprocessor trends– Number of cores increased rapidly– Architectures vary widely
• Challenges for software development– Parallelization is now key for performance– Current parallel programming model: threads + locks
• Hard to develop correct and efficient parallel software• Hard to adapt software to changes in architectures
![Page 3: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/3.jpg)
Goals
• Automatically generate parallel implementation • Automatically tune parallel implementation
![Page 4: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/4.jpg)
Bamboo Compiler
OverviewProgram Processor Specification
Implementation Generator
Simulation-based Evaluator
Candidate implementations
Implementation Optimizer
Leading implementations
Profile Data
Multi-core Processor
Tuned implementations
Optimized multi-core binary
Code Generator
Optimized implementation
![Page 5: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/5.jpg)
Example
• MonteCarlo Example– Partitions problem into several simulations– Executes the simulations in parallel– Aggregates results of all simulations
![Page 6: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/6.jpg)
Bamboo Language
• A hybrid language combines data-flow and Java– Programs are composed of tasks– Tasks compose with dataflow-like semantics– Tasks contain Java-like object-oriented code internally– Programs cannot explicitly invoke tasks– Runtime automatically invokes tasks
• Supports standard object-oriented constructs including methods and classes
![Page 7: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/7.jpg)
Bamboo Language
• Flags – Capture current role (type state) of object in
computation– Each flag captures an aspect of the object’s state– Change as the object’s role evolves in program– Support orthogonal classifications of objects
![Page 8: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/8.jpg)
task startup(StartupObject s in initialstate) { Aggregator aggr = new Aggregator(s.args[0]){merge:=true}; for(int i = 0; i < 4; i++) Simulator sim = new Simulator(aggr){run:=true}; taskexit(s: initialstate:=false);}task simulate(Simulator sim in run) { sim.runSimulate(); taskexit(sim: run:=false, submit:=true);} task aggregate(Aggregator aggr in merge, Simulator sim in submit) { boolean allprocessed = aggr.aggregateResult(sim); if (allprocessed) taskexit(aggr: merge:=false, finished:=true; sim: submit:=false, finished:=true); taskexit(sim: submit:=false, finished:=true);}
class Aggregator { flag merge; flag finished; … }
class Simulator { flag run; flag submit; flag finished; ... }
![Page 9: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/9.jpg)
Bamboo Program Execution
Global Flagged Object Space
Runtime initialization
StartupObjectnew
initialstate state finished stateStartupObject
![Page 10: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/10.jpg)
Bamboo Program Execution
Global Flagged Object Space
StartupObject startup task
execute on
initialstate state finished stateStartupObject
![Page 11: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/11.jpg)
Bamboo Program Execution
Global Flagged Object Space
startup taskStartupObject
set
Aggregator
Simulator
Simulator Simulator
new
Simulator
merge state finished state
submit state
initialstate state finished stateStartupObject
Aggregator
Simulator run state finished state
![Page 12: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/12.jpg)
Bamboo Program Execution
Global Flagged Object Space
StartupObject
Aggregator
Simulator
Simulator Simulator
Simulatorsimulateexecut
e on
execute on
simulate task
execute onsimulate
task
execute on
simulate task
simulate task
merge state finished state
submit state
initialstate state finished stateStartupObject
Aggregator
Simulator run state finished state
![Page 13: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/13.jpg)
Bamboo Program Execution
Global Flagged Object Space
StartupObject
setAggregator
Simulator
Simulator Simulator
Simulatorsimulate
task
simulate task
simulate task
simulate taskset
set
set
merge state finished state
submit state
initialstate state finished stateStartupObject
Aggregator
Simulator run state finished state
![Page 14: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/14.jpg)
Bamboo Program Execution
Global Flagged Object Space
aggregate task
StartupObject
Aggregator
Simulator
Simulator Simulator
Simulator
execute on
merge state finished state
submit state
initialstate state finished stateStartupObject
Aggregator
Simulator run state finished state
![Page 15: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/15.jpg)
Bamboo Program Execution
Global Flagged Object Space
aggregate taskStartupObject
Aggregator
Simulator
Simulator Simulator
Simulator
set
merge state finished state
submit state
initialstate state finished stateStartupObject
Aggregator
Simulator run state finished state
![Page 16: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/16.jpg)
Bamboo Program Execution
Global Flagged Object Space
StartupObject
Aggregator
Simulator
Simulator Simulator
Simulatoraggregate
task
execute on
merge state finished state
submit state
initialstate state finished stateStartupObject
Aggregator
Simulator run state finished state
![Page 17: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/17.jpg)
Bamboo Program Execution
Global Flagged Object Space
StartupObject
Aggregator
Simulator
Simulator Simulator
Simulatoraggregate task
set
merge state finished state
submit state
initialstate state finished stateStartupObject
Aggregator
Simulator run state finished state
![Page 18: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/18.jpg)
Bamboo Program Execution
Global Flagged Object Space
aggregate task
StartupObject
Aggregator
Simulator
Simulator Simulator
Simulator
execute on
merge state finished state
submit state
initialstate state finished stateStartupObject
Aggregator
Simulator run state finished state
![Page 19: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/19.jpg)
Bamboo Program Execution
Global Flagged Object Space
aggregate task
StartupObject
Aggregator
Simulator
Simulator Simulator
Simulator
set
merge state finished state
submit state
initialstate state finished stateStartupObject
Aggregator
Simulator run state finished state
![Page 20: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/20.jpg)
Bamboo Program Execution
Global Flagged Object Space
StartupObject
Aggregator
Simulator
Simulator Simulator
Simulatoraggregate
taskexecute on
merge state finished state
submit state
initialstate state finished stateStartupObject
Aggregator
Simulator run state finished state
![Page 21: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/21.jpg)
Bamboo Program Execution
Global Flagged Object Space
StartupObject
Aggregator
Simulator
Simulator Simulator
Simulatoraggregate task
set
merge state finished state
submit state
initialstate state finished stateStartupObject
Aggregator
Simulator run state finished state
![Page 22: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/22.jpg)
Implementation Generation
Bamboo Compiler
Bamboo Program Processor Specification
Implementation Generator
Candidate implementations
Profile Data
![Page 23: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/23.jpg)
Implementation Generation
• Dependence Analysis: analyzes data dependence between tasks
• Parallelism Exploration: extracts potential parallelism
• Mapping to Cores: maps the program to real processor
![Page 24: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/24.jpg)
Flag State Transition Graph (FSTG)
Simulator
submit
finished
aggregate:2Mcyc; 100%
run
simulate:32Mcyc; 100%
![Page 25: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/25.jpg)
Combined Flag State Transition Graph (CFSTG)
StartupObject
initialstate
finishedstartup:3Mcyc; 100%
Simulator
run
submit
simulate:32Mcyc; 100%
finished
aggregate:2Mcyc; 100%
1
Aggregator
aggregate:2Mcyc; 75%
finishedaggregate:2Mcyc; 25%
merge
4
Number of new objects
![Page 26: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/26.jpg)
Core Group
Initial Mapping
StartupObject
initialstate
finished
startup:3Mcyc; 100%
Simulator
run
submit
simulate:32Mcyc; 100%
finished
aggregate:2Mcyc; 100%
1
Aggregator
aggregate:2Mcyc; 75%
finishedaggregate:2Mcyc; 25%
merge
4
![Page 27: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/27.jpg)
Preprocessing Phase
• Identifies strongly connected components (SCC) and merges them into a single core group
• Converts CFSTG into a tree of core groups by replicating core groups as necessary
![Page 28: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/28.jpg)
Data Locality RuleStartupObject
initialstate
finishedstartup:3Mcyc; 100%
Simulator
41Aggregator
aggregate:2Mcyc; 75%
finishedaggregate:2Mcyc; 25%
merge
run
Aggregator
StartupObject1
Simulator
4
• Default rule• Maximize data locality to
improve performance– Minimizes inter-core
communications– Improves cache behavior
![Page 29: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/29.jpg)
Data Parallelization Rule• To explore potential data
parallelism
Aggregator
StartupObject
Simulator 1
1Simulator
Simulator
Simulator1
1
1
Aggregator
StartupObject1
Simulator
4
StartupObjectinitialstate
finishedstartup:3Mcyc; 100%
Simulator
41Aggregator
aggregate:2Mcyc; 75%
finishedaggregate:2Mcyc; 25%
merge
run
![Page 30: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/30.jpg)
Rate Matching Rule• If the producer executes
multiple times in a cycle, how many consumers are required?
• Match two rates to estimate the number of consumers– Peak new object creation rate– Object consumption rate
Producer
…
initproduce
produce
Producer
Consumer
Consumer
…
Consumerrun
…
![Page 31: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/31.jpg)
Mapping to Processor
• Constraint: limited cores
Core 1 Core 2
• Map CFSTG core groups to physical cores
• Extended CFSTG
Aggregator
StartupObject
Simulator 1
1Simulator
Simulator
Simulator1
1
1
![Page 32: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/32.jpg)
Mapping to Cores• One possible mapping
Aggregator
StartupObject
Simulator 1
1Simulator
Simulator
Simulator1
1
1
Core 2
Core 1
![Page 33: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/33.jpg)
Mapping to Cores• Isomorphic mappings: have same performance
• Backtracking-based search: to generate non-isomorphic implementations
Aggregator
StartupObject
Simulator 1
1Simulator
Simulator
Simulator1
1
1
Aggregator
StartupObject
Simulator 1
1Simulator
Simulator
Simulator1
1
1
Core 2
Core 1Core 1
Core 2
![Page 34: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/34.jpg)
Implementation Generation
Bamboo Compiler
Simulation-based Evaluator
Candidate implementations
Implementation Optimizer
Leading implementations
Tuned implementations
Optimized implementation
![Page 35: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/35.jpg)
Simulation-Based Evaluation• To select the best candidate implementation• High-level simulation
– Does NOT actually execute the program– Constructs abstract execution trace with similar statistics– Compare the execution time or throughput and core usage
SimulatorCore
Task TaskCore
Task Task
![Page 36: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/36.jpg)
Simulation-Based Evaluation• Markov model
– Built from profile data– For each task estimates:
• The destination state• The execution time• A count of each type of new
objects
StartupObject
initialstate
fnishedstartup:3Mcyc; 100%
1Aggregator
1Simulator
Simulator
Simulator
Simulator
1
1
1
aggregate:2Mcyc; 75%aggregate:2Mcyc; 25%
merge
run
finished
submitsimulate:32Mcyc; 100%
finishedaggregate:2Mcyc; 100%
![Page 37: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/37.jpg)
Simulated Execution Tracecore 0 core 1
0 StartupObject(1)
3 Aggregator(1), Simulator (4)
4 Simulator(1)transfer a Simulator
35 Aggregator(1), Simulator(1), Simulator(2)36 Simulator(1)
67 Aggregator(1), Simulator(3), Simulator(1)
37 transfer a Simulator
99 Aggregator(1), Simulator(4)
101 Aggregator(1), Simulator(3)
103 Aggregator(1), Simulator(2)
105 Aggregator(1), Simulator(1)
107 empty
Aggregator(1), Simulator(2), Simulator(2)
Aggregator(1), Simulator(4)
1 Aggregator in the initial state and 4 Simulators in the submit state
![Page 38: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/38.jpg)
Problem of Exhaustive Searching
• The search space expands quickly• Exhaustive search is not feasible for complicated
applications
Number of CFSTG Core Groups Number of Cores Number of Candidates
32 16 > 6,00064 32 > 14,000,000
![Page 39: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/39.jpg)
Random Search?• Very low chance to find the best implementation
Chance to find the best implementation
![Page 40: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/40.jpg)
Developer Optimization Process• Create an initial implementation• Evaluate it and identify performance bottlenecks• Heuristically develop new implementations to
remove bottlenecks• Iteratively repeat evaluation and optimization
![Page 41: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/41.jpg)
Directed Simulated Annealing (DSA)
Directed Simulated Annealing
Randomly generate candidate implementations
High-level Simulator
As-built Critical Path Analysis
Leading candidate implementations
Implementation Generator
Potential bottlenecks
Tuned candidate implementation
New candidate implementations
![Page 42: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/42.jpg)
As-Built Critical Path (ABCP)
Aggregator
StartupObject
Simulator
11
Simulator
Simulator
Simulator1
11
• Provide post-mortem analysis of project managementcore 0 core 1
0 StartupObject(1)
3 Aggregator(1), Simulator (4)
4 Simulator(1)transfer a Simulator
35 Aggregator(1), Simulator(1), Simulator(2)36 Simulator(1)
67 Aggregator(1), Simulator(3), Simulator(1)
37 transfer a Simulator
99 Aggregator(1), Simulator(4)
101 Aggregator(1), Simulator(3)
103 Aggregator(1), Simulator(2)
105 Aggregator(1), Simulator(1)
107 empty
Aggregator(1), Simulator(2), Simulator(2)
![Page 43: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/43.jpg)
As-Built Critical Path Analysis
• Compute the time when a task invocation’s data dependences are resolved
core 0 core 1
0 StartupObject(1)
3 Aggregator(1), Simulator (4)
4 Simulator(1)transfer a Simulator
35 Aggregator(1), Simulator(1), Simulator(2)36 Simulator(1)
67 Aggregator(1), Simulator(3), Simulator(1)
37 transfer a Simulator
99 Aggregator(1), Simulator(4)
101 Aggregator(1), Simulator(3)
103 Aggregator(1), Simulator(2)
105 Aggregator(1), Simulator(1)
107 empty
Aggregator(1), Simulator(2), Simulator(2)
![Page 44: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/44.jpg)
Waiting Task Optimization
• Waiting tasks: – Tasks whose real invocation time is later than the time when all
its data dependences are resolved– Delayed because of resource conflicts– Bottlenecks, remove them from ABCP
• Optimization– Migrate waiting tasks to spare cores– Shorten the ABCP to improve performance
![Page 45: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/45.jpg)
Critical Task Optimization• There may not exist spare cores to move waiting tasks to• Identify critical tasks: tasks that produce data that is
consumed immediately• Attempt to execute critical tasks as early as possible• Migrate other tasks which blocked some critical task to
other corescore 0 core 1
35 Aggregator(1), Simulator(1), Simulator(2)36 Simulator(1)
67 Aggregator(1), Simulator(3), Simulator(1)
99 Aggregator(1), Simulator(4)
101 Aggregator(1), Simulator(3)
Simulator(2)
1
2
![Page 46: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/46.jpg)
Code Generator
Bamboo CompilerOptimized multi-core binary
Code Generator
Optimized implementation
Intermediate C code
![Page 47: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/47.jpg)
Evaluation
• MIT RAW simulator– Cycle accurate simulator configured for 16 cores– RAW chip: tiled chip, shared memory, on-chip network
• Benchmarks:– Series: Java Grande benchmark suite– MonteCarlo: Java Grande benchmark suite– FilterBank: StreamIt benchmark suite– Fractal
![Page 48: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/48.jpg)
Speedups on 16 cores
Benchmark Clock Cycles (106 cyc) Speedup to 1-Core Bamboo1-Core
Bamboo16-Core Bamboo
Series 26.4 1.8 14.7
Fractal 38.4 3.3 11.6
MonteCarlo 191.7 19.0 10.1
FilterBank 91.2 6.7 13.6
• Successfully generated implementations with good performance
![Page 49: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/49.jpg)
Comparison to Hand-Written C CodeBenchmark Clock Cycles (106 cyc) Speedup
to 1-Core COverhead of
Bamboo1-Core C 1-Core Bamboo
16-Core Bamboo
Series 25.0 26.4 1.8 13.9 5.6%
Fractal 36.2 38.4 3.3 11.0 6.1%
MonteCarlo 138.8
191.7 19.0 7.3 38.1%
FilterBank 71.1 91.2 6.7 10.6 28.3%• Overhead of Bamboo:– Small for Series and Fractal– Larger overhead for MonteCarlo and FilterBank:
• GCC cannot reorder instructions to fill floating-point delay slots for Bamboo implementations due to imprecise alias results
• Easy to add alias information to facilitate the reordering
![Page 50: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/50.jpg)
Comparison of Estimation and Real Execution
• The simulation estimations are close to the real execution time
Benchmark 1-Core Bamboo Binary 16-Core Bamboo Binary
Clock Cycles (106 cyc) Error Clock Cycles (106 cyc) Error
Estimation Real Estimation Real
Series 26.3 26.4 0.38% 1.7 1.8 5.56%
Fractal 38.4 38.4 0% 3.1 3.3 6.06%
MonteCarlo 191.0 191.7 0.37% 18.3 19.0 3.68%
FilterBank 91.2 91.2 0% 6.5 6.7 2.99%
![Page 51: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/51.jpg)
Optimality of Directed Simulated Annealing
![Page 52: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/52.jpg)
Fractal
![Page 53: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/53.jpg)
MonteCarlo
![Page 54: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/54.jpg)
FilterBank
![Page 55: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/55.jpg)
Generality of Synthesized Implementation
• The speedups of both 16-core Bamboo versions are similar
• Successfully generate a sophisticated implementation utilizing pipelining for MonteCarlo
Benchmark Profile_original, Input_double Profile_double, Input_double
Clock Cycles (106 cyc) Speedup Clock Cycles (106 cyc) Speedup
1-Core 16-Core 16-Core
Series 54.2 3.6 15.1 3.6 15.1
Fractal 76.6 6.5 11.8 6.5 11.8
MonteCarlo 383.2 37.8 10.1 35.7 10.7
FilterBank 182.3 13.3 13.7 13.3 13.7
![Page 56: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/56.jpg)
Related Work• Data-flow and streaming languages:
– Bamboo relaxes typical restrictions in these models to permit:
• Flexible mutation of data structures• Data structures of arbitrarily complex constructs
– Bamboo supports applications that non-deterministically access data
• Tuple-space language: compiler cannot automatically create multiple instantiations to utilize multiple cores
• Self-tuning libraries: mostly address specific computations
![Page 57: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/57.jpg)
Conclusion• We developed a new approach to automatically
tune task-based programs for multi-core processors– Automatically generate parallel implementations– Automatically tune according to specific architecture
• The approach was evaluated on MIT RAW simulator– Successfully generated implementations with good
performance– Successfully generated a sophisticated implementation
utilizing pipelining
• Can be extended to the broader context of traditional programming languages
![Page 58: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/58.jpg)
Thank you!
![Page 59: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/59.jpg)
Future Work
• Apply our approach on non-simulated multi-core processors
• Develop more sophisticated processor specification
• Explore rich set of applications
![Page 60: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/60.jpg)
Design Rationale
• Why not dynamic scheduling?– Bad scalability over increasing cores– Our basic approach makes it easier to adapt to future
changes in architectures
![Page 61: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/61.jpg)
Tree Transform
Producer1
Consumer1Producer2
4
1
Producer1
Consumer
1Producer2
4
1
Consumer
![Page 62: Automatically Tuning Task-Based Programs for Multi-core Processors](https://reader035.vdocuments.mx/reader035/viewer/2022062519/56814d62550346895dbaae86/html5/thumbnails/62.jpg)
Tags
• Motivation: consider a video processor example• Tags group objects together:
– Tags have types– Can create many instances of a tag type– Each instance defines a group
• Can bind tag instances to objects• Tags can specify that task parameters must be in
the same group