a survey on scheduling methods of task-parallel processing
DESCRIPTION
A Survey on Scheduling Methods of Task-Parallel Processing. Chikayama and Taura Lab M1 48-096415 Jun Nakashima. Agenda. Introduction Basic Scheduling Methods Challenges and solutions Consideration Summary. Motivation. Thread and task have many in common Both are unit of execution - PowerPoint PPT PresentationTRANSCRIPT
A Survey on Scheduling Methods of Task-Parallel Processing
Chikayama and Taura LabM1 48-096415Jun Nakashima
1
Agenda
• Introduction• Basic Scheduling Methods• Challenges and solutions• Consideration• Summary
2
Motivation
• Thread and task have many in common– Both are unit of execution– Multiple threads/tasks may be executed
simultaneously
• Scheduling methods of tasks can be useful for that of threads
3
Background
• Demand of exploiting dynamic and irregular parallelism• Simple parallelization (pthread,OpenMP,…) is not
efficient– Few threads : Difficulties of load balancing– Many threads : Good load balance but overhead is not
bearable• Example:– N-Queens puzzle– Strassen’s algorithm (matrix-matrix product)– LU Factorization of sparse matrix
4
Task-Parallel Processing
• Decompose entire process into tasks and execute them in parallel– Task : Unit of execution much lighter than thread– Fairness of tasks are not considered
• May be deferred or suspended
• Representation of dependence– Task creation by a task– Wait for child tasks
• Programming environments with task support :– Cilk, X10, Intel TBB, OpenMP(>3.0),etc…
5
Task-Parallel Processing(2)
A simple example Task graphtask task_fib(n){ if (n<=1)return 1;
t1=create_task(task_fib(n-2)); //create task t2=create_task(task_fib(n-1));
ret1=task_wait(t1); //wait for children ret2=task_wait(t2);
return ret1+ret2;}
fib(n)
fib(n-1)fib(n-2)
fib(n-4) fib(n-2)fib(n-3)fib(n-3)
Tasks of same color can be executed in parallel6
Basic execution model
• Forks threads up to the number of CPU cores– Each thread has queues for tasks– Assign a task by one thread
fib(n)
fib(n-1)fib(n-2)
fib(n-4) fib(n-2)fib(n-3)fib(n-3)
Thread1 Thread2
fib(n)
fib(n-2) fib(n-1)
7
Agenda
• Introduction• Basic Scheduling Methods• Challenges and solutions• Consideration• Summary
8
fib(n-2)fib(n-1)fib(n-2)
Basic scheduling strategy : Breadth-First and Work-first
Breadth-First• At task creation :
– Enqueue new task– Execute child when parent
task suspends
Work-First• At task creation
– Parent task always suspends and run child
– Continue parent when child task is finished
Thread
fib(n)
Thread
running
ready
ready
fib(n)running
fib(n-4)
waiting
running
ready
readyrunning
running
9
Work stealing
• Load-balancing technique of threads for work-first scheduler– Idle threads steal runnable
tasks from other threads
• Basic strategy : FIFO– Steals oldest task in the
task queue– Victim thread should be
chosen at random
fib(n-2)
Thread
fib(n)
fib(n-4)
ready
ready
running
Thread
running fib(n-1)
running
ready
10
Steal request
Steals oldest task
Effect of Work Stealing
• Old task tends to create many tasks in the future– Especially recursive
parallelism
fib(n)
fib(n-1)fib(n-2)
fib(n-4)fib(n-2)
fib(n-3)fib(n-3)
Thread 1’s task
Thread 2’s task
Task graph of previous page
11
Lazy Task Creation
• Save continuation of parent task instead of creating child task– Continuation is lighter
than task
• At work stealing, crate task from continuation and steal it
fib(n-2)
Thread
fib(n)
fib(n-4)
ready
ready
running
Thread
running
Create task and steal it
fib(n-1)
running
ready
12
fib(n)
fib(n-2)
Continutation (≠ Task)
Steal request
fib(n)
Cut-off
• Execute child task sequentially instead of creating– To avoid too fine-grained
tasks
• Basic cut-off strategy– Amount of tasks– Recursive depth
fib(n)
fib(n-1)fib(n-2)
fib(n-4)fib(n-2)
fib(n-3)fib(n-3)
Execute serially
fib(n-3)fib(n-4)
13
Agenda
• Introduction• Basic Scheduling Methods• Challenges and solutions• Consideration• Summary
14
Challenges
• Architecture-aware scheduling
• Scalable implementation
• Determination of cut-off threshold
15
Architecture-aware scheduling
• Basic methods are not considered of architecture
• In some architecture performance is degraded• Example : NUMA architecture
16
Core1 Core2 Core3 Core4
Interconnect
Memory Memory
Core1
NUMA Architecture
• NUMA = Non Uniform Memory Access• Memory access cost depends on CPU core and
address• Considering locality is very important!
17
Core2 Core3 Core4
Interconnect
Memory Memory
Remote memory access is slowLocal memory access is fast
A bad case on NUMA
• When a thread steals a task of remote CPU• More remote memory access
18
Core1 Core2 Core3 Core4
Memory Memory
task
data
Local memory accessRemote memory access
Affinity Bubble-Scheduler
• Scheduling Dynamic OpenMP Applications over Multicore Architecture(Broquedis et al.)
• Locality-aware thread scheduler• Based on BubbleSched :– Framework to implement scheduler on hieratical
architecture– Threads are grouped by bubbles– Scheduler uses bubbles as hints
19
What is bubble?
• Group of tasks and bubbles– Describes affinities of
tasks
• Call library function to create
• Grouped tasks use shared data
20
task
task
task
tasktask
task
task
task
Initial task distribution
• Explode bubbles hieratically
21
Core1 Core2 Core3 Core4
task
task
task
tasktask
task
task
task
Explode the root bubbleDivide to balance loadExplode a bubble to distribute to 2 CPU cores
Steals from local
NUMA-aware Work Stealing
• Idle threads steal tasks from as local thread as possible
22
task
task
task task
Core1 Core2 Core3 Core4
task
task
task
Challenges
• Architecture-aware scheduling– Affinity Bubble-scheduler
• Scalable implementation
• Determination of cut-off threshold
23
Scalable implementation
• When operating task queues, threads have to acquire a lock– Because task queues may
be accessed by multiple threads
• Task queue operation occur every task creation and destruction
• Locks may be serious bottleneck!
24
Thread
task
Thread
task
task
Steal request
Finished!Need to lock the entire queue
Steal request
A simple way to decrease locks
• Double Task Queue per thread– One for local and one for
public
• Tasks are stolen only from public queue
• Local queue is lock-free
25
Threadlock-free!
Thread
task
task
task
public
local
Need to lock the public queue only
task
Problem of double task queue
• When task is moved, memory copy is required
26
Thread
public
local
task
Task copy is required
Split Task Queues
• Scalable Work Stealing (Dinal et al.)
• Split task queue by “split pointer”– From head to split
pointer : Local potion– From split pointer to
tail : Public potion
Thread
task
Thread
task
taskpublic
local
lock-free!
27
Split Task Queues
• Move pointer to head if public potion gets empty– This operation is lock-free
• Move pointer to tail if local potion gets empty
• Task copy is not required
Thread
task
task
task
task
Thread
task
task public
local
28
And more…
• In “Scalable work stealing” (Dianl et al.)
• Efficient task creation– Initialize task queue directly
• Better amount of tasks to steal– Half of public queue
29
Challenges
• Architecture-aware scheduling– Affinity Bubble-Scheduler
• Scalable implementation– Split Task Queues
• Determination of cut-off threshold
30
Determination of cut-off threshold
• Appropriate cut-off threshold cannot be determined simply– Depends on algorithm, scheduling methods, and
input data• Too large : Tasks become too coarse-grained– Leads to load imbalance
• Too small : Tasks become too fine-grained– Large overhead
31
Profile-based cut-off determination
• An adaptive cut-off for task parallelism (Duran et al.)
• Use 2 profiling methods– Full Mode– Minimal Mode
• Estimate execution time and decide cut-off
32
Full Mode
• Measure every tasks’ execution time• Heavy overhead
• Complete information
33
fib(n)
fib(n-1)fib(n-2)
fib(n-4)fib(n-2)
fib(n-3)fib(n-3)
Collect execution time
Depth Time
1
2
3
XXX ms
YYY ms
ZZZ ms???
???
???
Minimal Mode
• Measure execution time of “real tasks”• Small overhead• Incomplete information– Cut-off tasks are not measured
34
Collect execution time
fib(n)
fib(n-1)fib(n-2)
fib(n-4)fib(n-2)
fib(n-3)fib(n-3)fib(n-3)
fib(n-4)
These tasks are not measured
Depth Time
1
2
3 ???
XXX ms
YYY ms???
??? fib(n-2)
fib(n-3)
Adaptive Profiling
• Collects execution time for each depth of recursion
• Use Full Mode until enough information is collected
• After that, use Minimal Mode
fib(n)
fib(n-1)fib(n-2)
fib(n-4)fib(n-2)
fib(n-3)fib(n-3)
Profiled(Full Mode)
Maybe not profiled(Minimal Mode)
35
12 3
4Execution order
Depth Time
1 XXX ms
2 YYY ms
3 ZZZ ms
Cut-off strategy
• Estimates execution time of the task by collected information– Average of previous
executions
• If estimated execution time is smaller than threshold, apply cut-off
fib(n)
fib(n-1)fib(n-2)
fib(n-4)fib(n-2)
fib(n-3)fib(n-3)
How long the task will take ?
If estimated time is larger,create new task and execute in parallel
fib(n-3)
If estimated time is smaller,execute serially
36
fib(n-2)
12 3
4Execution order
Agenda
• Introduction• Basic Scheduling Methods• Challenges and solutions• Consideration• Summary
37
Consideration
• When adopting task methods into thread scheduling, it is necessary to consider side-effect
• Main difference between task and thread is fairness
• Fairness : Runnable threads take equal CPU time (based on priority)– Any thread never keeps CPU forever
38
Consideration of fairness
• Affinity Bubble Scheduler– Originally designed for threads
• Split task queues– Data structure for reducing locks improves scalability– Basic idea does not impede fairness
• Profile-based cut-off– Can apply cut-off only short-lived thread– It makes easier to apply cut-off
39
Summary
• Basic scheduling methods• Challenges and solutions– Architecture-aware scheduling
• Affinity Bubble-Scheduler
– Scalable implementation• Split Task Queues
– Determination of cut-off threshold• Profile-based cut-off
• Consideration– These solutions are NOT SO harmful for fairness
40
Thanks for your attention!
41