criticality-aware dynamic task scheduling for heterogeneous...

10
Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architectures Kallia Chronaki * , Alejandro Rico * , Rosa M. Badia *† , Eduard Ayguadé *‡ , Jesús Labarta *‡ , Mateo Valero *‡ * Barcelona Supercomputing Center, Barcelona, Spain Artificial Intelligence Research Institute (IIIA) - Spanish National Research Council (CSIC), Barcelona, Spain Universitat Politècnica de Catalunya - BarcelonaTech, Barcelona, Spain [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] ABSTRACT Current and future parallel programming models need to be portable and efficient when moving to heterogeneous multi- core systems. OmpSs is a task-based programming model with dependency tracking and dynamic scheduling. This pa- per describes the OmpSs approach on scheduling dependent tasks onto the asymmetric cores of a heterogeneous system. The proposed scheduling policy improves performance by prioritizing the newly-created tasks at runtime, detecting the longest path of the dynamic task dependency graph, and assigning critical tasks to fast cores. While previous works use profiling information and are static, this dynamic scheduling approach uses information that is discoverable at runtime which makes it implementable and functional without the need of an oracle or profiling. The evaluation results show that our proposal outperforms a dynamic im- plementation of Heterogeneous Earliest Finish Time by up to 1.15×, and the default breadth-first OmpSs scheduler by up to 1.3× in an 8-core heterogeneous platform and up to 2.7× in a simulated 128-core chip. Categories and Subject Descriptors C.1.3 [Processor Architectures]: Heterogeneous systems; C.1.4 [Processor Architectures]: Mobile processors; D.1.3 [Programming Techniques]: Parallel programming; D.3.2 [Programming Languages]: Parallel languages Keywords High performance computing, Scheduling, Heterogeneous sys- tems, Task-based programming models 1. INTRODUCTION The future of high-performance computing is highly re- stricted by energy efficiency [18]. The use of heterogeneous Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICS’15, June 8–11, 2015, Newport Beach, CA, USA. Copyright c 2015 ACM 978-1-4503-3559-1/15/06 ...$15.00. http://dx.doi.org/10.1145/2751205.2751235. multi-core architectures is an approach towards increasing energy efficiency [13, 14]. These architectures feature dif- ferent types of processing cores that are designed to target different performance and power optimization points. Load balancing and scheduling are two of the main chal- lenges in utilizing such heterogeneous platforms. The use of task-based programming models with dynamic schedul- ing is an answer to tackle these challenges. Some of these programming models allow the specification of inter-task de- pendencies that enable automatic scheduling and synchro- nization by the runtime system. OmpSs [10, 4] is an ex- ample of this type of programming models. It maintains a dynamic directed-acyclic graph of tasks with their current state. Whenever a task’s dependencies are satisfied, it be- comes ready to be scheduled to an available core. The mapping of ready tasks to different types of cores on a heterogeneous system becomes a challenge when considering the reduction of the total execution time. Some task-based applications expose a complex dependency graph in which tasks in the critical path determine the total application du- ration. This opens an opportunity to accelerate the overall application by running critical tasks on fast cores. Some previous works [15, 9, 25, 20] tackled this issue using static scheduling over the whole dependency graph to statically map tasks to processors on a heterogeneous system. How- ever, they required the knowledge of profiling information and most of them were evaluated on synthetic randomly- generated task dependency graphs (TDGs). In this paper, we propose a criticality-aware dynamic task scheduler that dynamically assigns critical tasks to fast cores to improve performance in a heterogeneous system with fast and slow cores. Compared to previous proposals, this sched- uler is based on information discoverable at runtime, is im- plementable and works without the need of an oracle or pro- filing. Furthermore, our evaluation is based on a real het- erogeneous multi-core platform with real applications and, therefore, using real TDGs. The contributions of this paper are the following: A novel criticality-aware task scheduler (CATS) that dynamically assigns critical tasks to fast cores in a het- erogeneous multi-core. Tasks are defined to be critical if they are part of the longest path in the in-flight dynamic state of the dependency graph. The flexi- bility and work stealing policy of our scheduler are configurable. Flexibility increases the number of tasks

Upload: others

Post on 16-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architecturespersonals.ac.upc.edu/arico/papers/ics15_heterosched... · 2015-05-11 · Criticality-Aware Dynamic Task Scheduling

Criticality-Aware Dynamic Task Scheduling forHeterogeneous Architectures

Kallia Chronaki∗, Alejandro Rico∗, Rosa M. Badia∗†,Eduard Ayguadé∗‡, Jesús Labarta∗‡, Mateo Valero∗‡

∗Barcelona Supercomputing Center, Barcelona, Spain†Artificial Intelligence Research Institute (IIIA) -

Spanish National Research Council (CSIC), Barcelona, Spain‡Universitat Politècnica de Catalunya - BarcelonaTech, Barcelona, [email protected], [email protected], [email protected],

[email protected], [email protected], [email protected]

ABSTRACTCurrent and future parallel programming models need to beportable and efficient when moving to heterogeneous multi-core systems. OmpSs is a task-based programming modelwith dependency tracking and dynamic scheduling. This pa-per describes the OmpSs approach on scheduling dependenttasks onto the asymmetric cores of a heterogeneous system.The proposed scheduling policy improves performance byprioritizing the newly-created tasks at runtime, detectingthe longest path of the dynamic task dependency graph,and assigning critical tasks to fast cores. While previousworks use profiling information and are static, this dynamicscheduling approach uses information that is discoverableat runtime which makes it implementable and functionalwithout the need of an oracle or profiling. The evaluationresults show that our proposal outperforms a dynamic im-plementation of Heterogeneous Earliest Finish Time by upto 1.15×, and the default breadth-first OmpSs scheduler byup to 1.3× in an 8-core heterogeneous platform and up to2.7× in a simulated 128-core chip.

Categories and Subject DescriptorsC.1.3 [Processor Architectures]: Heterogeneous systems;C.1.4 [Processor Architectures]: Mobile processors; D.1.3[Programming Techniques]: Parallel programming; D.3.2[Programming Languages]: Parallel languages

KeywordsHigh performance computing, Scheduling, Heterogeneous sys-tems, Task-based programming models

1. INTRODUCTIONThe future of high-performance computing is highly re-

stricted by energy efficiency [18]. The use of heterogeneous

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’15, June 8–11, 2015, Newport Beach, CA, USA.Copyright c© 2015 ACM 978-1-4503-3559-1/15/06 ...$15.00.http://dx.doi.org/10.1145/2751205.2751235.

multi-core architectures is an approach towards increasingenergy efficiency [13, 14]. These architectures feature dif-ferent types of processing cores that are designed to targetdifferent performance and power optimization points.

Load balancing and scheduling are two of the main chal-lenges in utilizing such heterogeneous platforms. The useof task-based programming models with dynamic schedul-ing is an answer to tackle these challenges. Some of theseprogramming models allow the specification of inter-task de-pendencies that enable automatic scheduling and synchro-nization by the runtime system. OmpSs [10, 4] is an ex-ample of this type of programming models. It maintains adynamic directed-acyclic graph of tasks with their currentstate. Whenever a task’s dependencies are satisfied, it be-comes ready to be scheduled to an available core.

The mapping of ready tasks to different types of cores on aheterogeneous system becomes a challenge when consideringthe reduction of the total execution time. Some task-basedapplications expose a complex dependency graph in whichtasks in the critical path determine the total application du-ration. This opens an opportunity to accelerate the overallapplication by running critical tasks on fast cores. Someprevious works [15, 9, 25, 20] tackled this issue using staticscheduling over the whole dependency graph to staticallymap tasks to processors on a heterogeneous system. How-ever, they required the knowledge of profiling informationand most of them were evaluated on synthetic randomly-generated task dependency graphs (TDGs).

In this paper, we propose a criticality-aware dynamic taskscheduler that dynamically assigns critical tasks to fast coresto improve performance in a heterogeneous system with fastand slow cores. Compared to previous proposals, this sched-uler is based on information discoverable at runtime, is im-plementable and works without the need of an oracle or pro-filing. Furthermore, our evaluation is based on a real het-erogeneous multi-core platform with real applications and,therefore, using real TDGs.

The contributions of this paper are the following:

• A novel criticality-aware task scheduler (CATS) thatdynamically assigns critical tasks to fast cores in a het-erogeneous multi-core. Tasks are defined to be criticalif they are part of the longest path in the in-flightdynamic state of the dependency graph. The flexi-bility and work stealing policy of our scheduler areconfigurable. Flexibility increases the number of tasks

Page 2: Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architecturespersonals.ac.upc.edu/arico/papers/ics15_heterosched... · 2015-05-11 · Criticality-Aware Dynamic Task Scheduling

considered critical. Work stealing may be uni- or bi-directional: only fast cores can steal from slow cores,or slow cores can also steal from fast cores.

• An evaluation of our implementation of CATS in theOmpSs programming model compared to a dynamicimplementation of Heterogeneous Earliest Finish Time[25] and the default OmpSs scheduler. We evaluate theeffectiveness of CATS on different numbers of cores andshares of fast and slow cores on an Odroid-XU3 devel-opment board featuring an eight-core Samsung Exynos5422 chip with ARM big.LITTLE architecture includ-ing four Cortex-A15 and four Cortex-A7 cores. Wealso evaluate the effectiveness of CATS on differentspeed ratios between fast and slow cores using simula-tion of heterogeneous multi-cores with up to 128 cores.The results show that CATS improves overall perfor-mance up to 1.3× on the real eight-core platform, andup to 2.7× on a simulated 128-core system.

2. RELATED WORKThe search for efficient task scheduling on multi-core sys-

tems has been intensively studied. Most scheduling heuris-tics target homogeneous multiprocessors, nevertheless thereexists an important number of studies in heterogeneous mul-tiprocessors. In this section we give an overview of differ-ent categories of schedulers for heterogeneous systems, weexplain some details about schedulers targeting specific sys-tems using compute accelerators and explain details of pre-vious works on criticality-aware schedulers.

2.1 Schedulers for Heterogeneous SystemsPrevious works on schedulers for heterogeneous systems

form four different types of schedulers: listing, clustering,guided-random, and duplication-based schedulers.

Listing schedulers [1, 15, 9, 25, 20] have two schedulingstages. In the first stage, each task is given a priority basedon the policy defined in each algorithm. In the second stage,tasks are assigned to processors depending on their priori-ties. Most criticality-aware schedulers fall in this category,and we discuss them in Section 2.3. The scheduler proposedin this paper is also a list scheduler.

Clustering schedulers [26, 27, 15, 17] first separate tasksinto clusters, where each cluster is to be executed on thesame processor. During the clustering stage, the algorithmassumes an unlimited number of available processors in thesystem. If the number of clusters exceeds the number ofavailable cores, the merging stage joins multiple clusters sothat they match the number of available processors. An ex-ample is the Levelized Min Time [17] clustering scheduler.This heuristic clusters tasks that can execute in parallel ac-cording to their level (i.e. sibling nodes in a graph have thesame level), and assigns priorities to the tasks in a clusteraccording to its execution time, (i.e. tasks with the high-est execution time have the highest priority). The task-toprocessor assignment is done in decreasing order of priority.

Guided-random schedulers [28, 19, 21] randomize theirschedules by applying policies influenced by other sciences.Genetic algorithms [28] group tasks into generations andschedule them according to a randomized genetic technique.Chemical reaction algorithms [19] mimic molecular interac-tions to map tasks to processors. Some of these guided-random approaches [28, 19] are designed for heterogeneous

systems. The scheduler by Page et al. [21] enables dynamicscheduling of multiple-sized tasks for heterogeneous systems.However, it does not support dependencies between tasks.

Duplication-based schedulers [5, 29, 2] aim to eliminatecommunication costs between processors by scheduling tasksand their successors on the same processor. If a task hasmany successors, it is duplicated and executed in multiplecores prior to its successors so all successor tasks get the datafrom their predecessors with the lowest communication cost.This scheduling potentially introduces redundant duplica-tions of tasks which may lead to bad schedules. The Het-erogeneous Economical Duplication scheduler [2] performstask duplication in an economical manner as it removes theredundant duplicates if they do not affect performance.

These previous works schedule tasks statically and assumethe prior knowledge of the task execution times on the dif-ferent processor types in the heterogeneous system.

2.2 Schedulers for Compute AcceleratorsThe schedulers in the previous section target the schedul-

ing of generic TDGs on generic heterogeneous architectures.In this section we cover schedulers that target specific sys-tems with compute accelerators. These works are more fo-cused on the scheduling of tasks on the target platform basedon the abstractions provided by the corresponding mixtureof programming models for the general-purpose processorsand the compute accelerators in the system.

Most heterogeneous systems with compute acceleratorsnowadays combine general-purpose CPUs and GPU com-pute accelerators. There is a set of programming modelsproviding abstractions to ease the development of applica-tions on these platforms. OmpSs [10, 4] offers this abstrac-tion by allowing multiple implementations of a given task tobe executed on different processing units [23]. The sched-uler then assigns the execution of a task to the best re-source according to its earliest finish time. Another case isStarPU [3], a library that offers runtime heterogeneity sup-port and provides priority schedulers for task-to-processorallocation. AHP [22] is another framework that generatessoftware pipelines for heterogeneous systems and schedulestasks to their earliest executor, based on profiling informa-tion gathered prior to runtime.

None of these works, however, take into account the criti-cality of tasks regarding task dependencies, but they ratherfocus on the earliest execution time of individual tasks onthe processor types in the specific system configuration.

2.3 Criticality-Aware SchedulersSeveral previous works propose scheduling heuristics that

focus on the critical path in a TDG to reduce total executiontime [15, 9, 25, 20]. To identify the tasks in the critical path,most of these works use the concept of upward rank anddownward rank. The upward rank of a task is the maximumsum of computation and communication cost of the tasksin the dependency chains from that task to an exit node inthe graph. The downward rank of a task is the maximumsum of computation and communication cost of the tasks inthe dependency chain from an entry node in the graph upto that task. Each task has an upward rank and downwardrank for each processor type in the heterogeneous system,as the computation and communication costs differ acrossprocessor types.

Page 3: Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architecturespersonals.ac.upc.edu/arico/papers/ics15_heterosched... · 2015-05-11 · Criticality-Aware Dynamic Task Scheduling

The Heterogeneous Earliest Finish Time (HEFT) algo-rithm [25] maintains a list of tasks sorted in decreasing or-der of their upward rank. At each schedule step, HEFTassigns the task with the highest upward rank to the pro-cessor that finishes the execution of the task at the earliestpossible time. Another work is the Longest Dynamic CriticalPath (LDCP) algorithm [9]. LDCP also statically schedulesfirst the task with the highest upward rank on every sched-ule step. The difference between LDCP and HEFT is thatLDCP updates the computation and communication costson multiple processors of the scheduled task by the compu-tation and communication cost in the processor to which itwas assigned.

The Critical-Path-on-a-Processor (CPOP) algorithm [25]also maintains a list of tasks sorted in decreasing order as inHEFT, but in this case it is ordered according to the additionof their upward rank and downward rank. The tasks with thehighest upward rank + downward rank belong to the criticalpath. On each step, these tasks are statically assigned to theprocessor that minimizes the critical-path execution time.

The main weaknesses of these works are that (a) theyassume prior knowledge of the computation and communi-cation costs of each individual task on each processor type,(b) they operate statically on the whole dependency graph,so they do not apply to dynamically scheduled applicationsin which only a partial representation of the dependencygraph is available at a given point in time, and (c) most ofthem use randomly-generated synthetic dependency graphsthat are not necessarily representative of the dependenciesin real workloads.

3. OMPSS PROGRAMMING MODELOmpSs is a task-based programming model that offers

a high level abstraction to the implementation of paral-lel applications for various homogeneous and heterogeneousarchitectures [10, 4]. It enables the annotation of func-tion declarations with the task directive, which declares atask. Every invocation of a such function creates a taskthat is executed concurrently with other tasks or parallelloops. OmpSs also supports task dependencies and it usesthe StarSs [12] dependency tracking mechanisms. OmpSs isbuilt with the support of the Mercurium compiler, respon-sible for the translation of the OmpSs annotation clauses tosource code, and the Nanos++ runtime system, responsiblefor the internal creation and execution of the tasks.

Nanos++ is an environment designed to serve as the run-time platform of OmpSs. It provides device support for het-erogeneity and includes different plug-ins for implementa-tions of scheduling policies, throttling policies, thread bar-riers, dependency tracking mechanisms, work-sharing andinstrumentation. This design allows to maintain the run-time features by adding or removing plug-ins. Thus, theimplementation of a new scheduler, or the support of a newarchitecture becomes simple.

The implementations of the different scheduling policiesin Nanos++ perform various actions on the states of thetasks. A task is created if a call to this task is discovered butit is waiting until all its inputs are produced by other pre-vious tasks. When all the input dependencies are satisfied,the task becomes ready. The ready tasks of the applicationat a given point in time are inserted in the ready queuesas stated by the scheduling policy. Ready queues can bethread-private or shared among multiple threads. When a

thread becomes idle, the scheduling policy picks a task fromthe ready queues for that thread to execute.

The Nanos++ internal data structures support task pri-oritization. The task priority is an integer field inside thetask descriptor that rates the importance of the task. If thescheduling policy supports priorities, the ready queues areimplemented as priority queues. In a priority queue, tasksare sorted in a decreasing order of their priority. The inser-tion in a priority queue is always ordered and the removalof a task is always from the head of the queue, i.e., the taskwith the highest priority. The priority of a task can be eitherset in user code, by using the priority clause, which acceptsan integer priority value or expression, or dynamically bythe scheduling policy, as is described in the next section.

4. CRITICALITY-AWARE SCHEDULERThe proposed scheduling algorithm generally applies to

task-based programming models supporting task dependen-cies, but for simplicity we explain it in the context of theOmpSs programming model.

The Criticality-Aware Task Scheduler (CATS) uses bottom-level longest-path priorities and consists of three steps:

• Task prioritization: when a task is created and addedto the TDG, it is assigned a priority and the priority ofthe rest of tasks in the graph is updated accordingly.

• Task submission: when a task becomes ready, i.e., allits predecessors finished their execution, it is submit-ted to a ready queue. At this point, the algorithmdecides whether the task is considered critical or noncritical. The task is then inserted in the correspondingready queue: tasks in the critical ready queue will beexecuted by fast cores, and tasks in the non-criticalready queue will be executed by slow cores.

• Task-to-core assignment : when a core becomes idle, ittries to retrieve a task from its corresponding readyqueue to execute it. If the queue is empty, it might tryto steal from the other queue depending on the workstealing policy. Currently, we support two work steal-ing mechanisms: simple work stealing, i.e., fast corescan steal from slow cores; and bi-directional (2DS)work stealing, i.e., both types can steal from the other.The default policy is simple.

These steps are performed dynamically and potentially inparallel in different cores. This means that while some tasksare being prioritized, previously created tasks may be sub-mitted, and others assigned to available cores or executed.

To give an overview of the scheduling process, Figure 1shows a scheme of the operation of CATS. In the TDG on theleft, each node represents a task and each edge of the graphrepresents a dependency between two tasks. The numberinside each node is the bottom level of the task: the lengthof the longest path in the dependency chains from this nodeto a leaf node. The priority of a task is given by its bot-tom level. The pattern-filled nodes indicate tasks that areconsidered critical. The number outside each node is thetask id and is used in the text to refer to each task. Criti-cal tasks are inserted in the critical queue, and non-criticaltasks to the non-critical queue. The insertion is ordered withthe highest priorities at the head of the queue and the low-est priorities at the tail. Slow cores retrieve tasks from the

Page 4: Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architecturespersonals.ac.upc.edu/arico/papers/ics15_heterosched... · 2015-05-11 · Criticality-Aware Dynamic Task Scheduling

0

4

2

5

3

1

63

2

1

0

0

Cri

tica

l qu

eu

e

No

n-C

riti

cal q

ue

ue

0

Fast CoreSlow Core

poppop

4

3

2

5

1

0

0

1

2

3

0 6 01 2 3

4 56

7

8

11

9 10

12

13R

ead

y Ta

sks’

Su

bm

issi

on

(ord

ere

d in

sert

ion

)

Figure 1: Task submission with CATS. Nodes are markedwith the bottom level of each task. Pattern-filled nodes markthe critical tasks.

head of the non-critical queue and fast cores from the criti-cal queue. The following sections describe these schedulingsteps in more detail.

4.1 Task PrioritizationEach task in the TDG has a list to include its predeces-

sors (plist). Every time an edge is added into the TDG onthe creation of a new task, the corresponding predecessor ofthe dependency is added in the plist of its successor. Forexample, in Figure 1, when the dependency between tasks 2and 5 occurs, the task number 2 is inserted into the plistof the task number 5. Thus, the plist of task number 5 be-comes {2}. Accordingly, the plist of task number 10 will be{2, 9} when the edge 9→10 is inserted to the TDG.

The priority given to a task is the bottom level of thetask. The bottom level is computed by traversing the TDGupwards starting from the successor that the currently cre-ated edge is pointing to. The priority of this successor is0 because it is a leaf node of the graph, as it is the lastcreated task. Then, using plist for each task, the algorithmnavigates to the upper levels of the TDG and updates thepriority on each visited node. This way not all the graphis updated, but only the tasks that are predecessors in thepaths to the new edge. The algorithm also stops going upthrough a path, when it finds a priority larger than the oneit would be updated to.

Listing 1 shows the algorithm for task prioritization. Thisfunction is called on the creation of a new edge with thesuccessor as argument. The algorithm traverses the plist ofthe successor task (line 5) and if the priority of the currentpredecessor is lower than the bottom level of the successorplus one, it updates the current predecessor’s priority tothat value (lines 7-8). If the updated predecessor task isready (i.e., it sits in one of the ready queues), the schedulerreorders the ready queue so it remains ordered consideringthe updated priority (lines 9-10). Then, the same actionsare performed recursively for each predecessor of the plistto update all the possible upward paths from the successor.

The terminate conditions for the navigation of the TDGare two: (a) if the plist of the current task (currPred) isempty, so either we reach an entry node or the predecessorsof the task have finished execution; or (b) if the priorityof the current task (currPred) remains unchanged, whichmeans that the successor task (succ) does not belong to thelongest path because its predecessor already has a higherpriority (bottom level).

1 void prioritize_task(task *succ) {2 int blev = succ ->priority;3 list plist = plistOf(succ);4 task *currPred;5 while( not isEmpty(plist) ) {6 currPred = plist.next ();7 if(priorityOf(currPred) < blev +1) {8 currPred ->priority = blev +1;9 if(isReady(currPred ))10 readyQueueOf(currPred)->reorder ();11 prioritize_task(currPred );12 }13 else return;14 }15}

Listing 1: Pseudo-code task prioritization with CATS.

4.2 Task SubmissionThe purpose of this step is to divide the tasks in two

groups: critical and non-critical. Critical tasks are tasksthat belong to the longest path of the dynamic TDG. Thelongest path is the path of the TDG with the maximum num-ber of tasks (or nodes). Thus, the longest path starts fromthe task with the maximum bottom level. At runtime, thelongest path changes as tasks complete execution and newtasks are created. CATS manages to detect those changesand dynamically decide if the submitted task belongs to thelongest path of the TDG.

When a task’s dependencies are satisfied, the task be-comes ready for execution and is to be inserted in the readyqueues. Ready queues are priority queues that keep tasks ina decreasing order of task priorities, i.e., the task with themaximum priority resides on the head of the queue. Criti-cal tasks are inserted in the critical queue and non-criticaltasks in the non-critical queue. The pattern-filled nodes inFigure 1 represent the critical tasks in that graph.

To determine the criticality of a task, CATS keeps trackof the last discovered critical task. Then, for each task thatbecomes ready, CATS checks the following conditions:

• Whether the priority of the current ready task is higherthan the priority of the last discovered critical task(maxPriority).

• Whether the current ready task is the highest-priorityimmediate successor of the last discovered critical task.

In the first case, the algorithm detects new longest pathsthat may have been created by the application throughoutthe execution of a prior longest path. In this case, the sched-uler can either be strict or flexible:

• Strict : marks as critical tasks with priority higher thanthe priority of the last critical task.

• Flexible: marks as critical tasks with priority higheror equal to the priority of the last critical task.

As a result, the flexible scheduler ends up with more criticaltasks than the strict. The flexibility of the scheduler can beset by the programmer through an environment variable.

The task that satisfies the second condition is a task witha lower priority than the maximum but the task belongs tothe longest path because it is the highest priority immediatesuccessor of the last detected critical task.

Listing 2 shows a simplified version of the task submissioncode. The variable maxPriority (line 1) is used to store the

Page 5: Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architecturespersonals.ac.upc.edu/arico/papers/ics15_heterosched... · 2015-05-11 · Criticality-Aware Dynamic Task Scheduling

1 int maxPriority = 1;2 task *maxPriorityTask = NULL;34 void submit_task(task *t) {5 if( t->priority >= maxPriority or6 (t->priority == maxPriority -1 and7 t ∈ succListOf(maxPriorityTask )) )8 { //the task is critical9 critical_queue.push(t);10 maxPriority = priorityOf(t);11 maxPriorityTask = t;12 return;13 }14 //the task is non -critical15 non_critical_queue.push(t);16}

Listing 2: Pseudo-code for task submission with CATS.

priority of the last critical task, and maxPriorityTask (line2) is used to store the last critical task. Initially, maxPri-ority is set to 1 and maxPriorityTask is set to NULL. Thisavoids the scheduling of independent tasks (i.e., tasks withzero priority) to fast processors at the start of the execution.On the first ready task, if its priority is higher (or equal ifin flexible mode) than 1 (line 5) , it is considered to be thefirst task of the longest path. Therefore, it is inserted in thecritical queue and the variables maxPriority and maxPrior-

ityTask are updated accordingly (lines 9-11) to determinecorrectly the criticality of the next submitted task.

If the priority of the submitted task is equal to maxPrior-

ity - 1, we check if it also belongs to the successors of thetask with the maximum priority (lines 6-7) and therefore tothe longest path. If these two conditions are met, the task isdetermined to be critical, it is inserted in the critical queueand, as before, the variables maxPriority and maxPriori-

tyTask are updated (lines 9-11). In the rest of the casesthe task is not considered critical and it is inserted in thenon-critical queue.

Figure 2 shows an example of a TDG during task sub-mission. The gray nodes in the graph are tasks that havefinished execution and the pattern-filled nodes are criticaltasks. The numbers inside the nodes indicate their priorityand the numbers outside the nodes show the task id, whichis assigned in task creation order. The variable maxPrioritycorresponds to the priority of the last critical task and themaxTaskSucc is the list of the successors of the last criticaltask, filled with the task ids of the successors. Initially, max-Priority is set to 1 and maxTaskSucc is empty. When task 2is about to be submitted, it is inserted in the critical queuebecause its priority is higher than the maximum, which atthe beginning is 1. Then, the value of maxPriority is setto 6 (priority of task 2), and the maxTaskSucc list is updatedwith the successors of task 2. At the point where all the graytasks have finished execution, the values of maxPriority andmaxTaskSucc are updated as shown in Figure 2. For everynewly-ready task, the conditions listed before will be evalu-ated. When task 7 is submitted, it will not be considered ascritical because it does not belong to the maxTaskSucc listand its priority is not equal to maxBotlev - 1. Contrarily,task 8 satisfies both conditions and so the task is inserted inthe critical queue.

4.3 Task-to-Core AssignmentTask-to-core assignment takes place dynamically and in

parallel to the previous steps. When a core becomes idle,

maxPriority = 1, maxTaskSucc = { }maxPriority = 6, maxTaskSucc = {4, 5, 6, 10}maxPriority = 5, maxTaskSucc = {8}...

48

3

9

2

10

55

112

0

13

03

0

6

111

27

34

01 62

4

3

9

2

10

112

0

131

11

27

8

• (4 == maxPriority-1) ✓

• (8 ∈ maxTaskSucc) ✓

Figure 2: Task submission. Gray nodes indicate finishedtasks and pattern-filled nodes indicate critical tasks.

it checks the corresponding ready queue (depending on thecore type) to get a task to execute. Fast cores retrieve crit-ical tasks from the critical queue, while slow cores retrievenon-critical tasks from the non-critical queue. Each readyqueue is shared among the cores of the same type so thereis no need for work stealing among cores of the same type.

If tasks in an application are imbalanced, i.e., the major-ity are non-critical and only a few tasks are critical, or viceversa, one of the types of processors would be overloadedand the other group would starve for work. This can hap-pen in applications with wide graphs and a large amount oftasks, where the ratio between critical tasks and the totalamount of tasks may be small. To leverage the resources,the default work-stealing mechanism for CATS lets fast coressteal work from slow cores whenever the critical queue be-comes empty. Also, CATS can be configured to performbi-directional work stealing so slow processors can also stealtasks from the critical queue if the non-critical queue be-comes empty. We evaluate these different options and showthe results in the next section.

5. EVALUATION

5.1 MethodologyWe measure the execution time of four criticality-sensitive

applications using CATS, the default OmpSs scheduler andthe Heterogeneous Earliest Finish Time scheduler (HEFT)[25] implemented in OmpSs. We evaluate four differentCATS configurations based on the options explained in Sec-tion 4. The options consist of whether the task submissionpolicy is strict or flexible, and the type of the work stealingmechanism. This results in the following configurations:

• Flexible with simple work stealing (SS FLEX)

• Flexible with bidirectional work stealing (2DS FLEX)

• Strict with simple work stealing (SS STRICT)

• Strict with bidirectional work stealing (2DS STRICT)

The default OmpSs scheduler employs a breadth-first pol-icy (BF) [11]. The BF scheduler implements a single first-in-first-out ready queue shared among all threads. When atask is ready, it is inserted in the tail of the ready queueand when a core becomes available, it retrieves a task fromthe head of the queue. Tasks are ordered according to theirready time: the earliest ready task resides at the head of thequeue. Since the ready queue is shared, there is no need for

Page 6: Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architecturespersonals.ac.upc.edu/arico/papers/ics15_heterosched... · 2015-05-11 · Criticality-Aware Dynamic Task Scheduling

Table 1: Evaluated benchmarks and relevant characteristics

Application Problem size #Tasks#Tasktypes

Avg task exec.time (µs)

Measuredperf. ratio

Critical tasks %

Choleskyfactorization

8×8 blocks of 1024×1024 floats 120 1 009 786 17.5016×16 blocks of 512×512 floats 816 4 147 812 2.23 5.5132×32 blocks of 512×512 floats 5984 147 812 2.60

QR factorization 16×16 blocks of 512×512 doubles 1 496 4 1 225 254 4.26 4.47Heat diffusion 16×16 blocks of 512×512 doubles 5 124 3 88 689 2.83 4.13

Integral Histogram 8×8 blocks of 512×512 floats 2 048 2 83 055 1.71 7.32

work stealing and the load is balanced automatically. BFdoes not differentiate among core types and assigns tasks ina first-come-first-served basis.

We implemented a dynamic version of the HEFT sched-uler (dHEFT) in the OmpSs programming model. The im-plementation assumes two different types of cores (big andlittle) and keeps records of the tasks’ execution times in eachcore. The original HEFT [25] implementation assumes theprior knowledge of the TDG as well as the costs of the tasks.In our case, since the evaluation consists of running real ap-plications, the best way to compare HEFT to our proposal isto keep the scheduling idea of HEFT and transform it froma static to a dynamic scheduler. This means that dHEFTdiscovers the costs of the tasks at runtime, computes a meanvalue of the costs for each task-type and parameter-size du-ple for each type of core, and then finds the core that willfinish the task at the earliest possible time. A differencebetween the static HEFT and dHEFT is the order in whichtasks are being submitted. In HEFT tasks are submitted indecreasing order of their upward rank, which is computedstatically with the known task costs. Since dHEFT discov-ers the costs as tasks execute, there is no priority in tasksubmission. In dHEFT tasks are submitted as soon as theybecome ready.

To find the earliest possible executor, dHEFT maintainsone list per core (wlist) including the ready tasks waitingto be executed by that core. When another task becomesready, dHEFT first checks if there are records of prior ex-ecution of this task. If the number of records is sufficient(in our experiments we require a minimum of three records)then the estimated execution time of the task is consideredstable. Then, using that estimated execution time, the taskis scheduled to the earliest executor by consulting the wlistof all the cores. If the number of records is not sufficientfor one of the core types, then the task is scheduled to theearliest executor of this core type to get another record ofthat task-type and core-type execution time. In all cases,dHEFT updates the history of records on every task execu-tion to adapt for phase changes in the application.

Our test bed comprises a big.LITTLE processor and asimulated heterogeneous system:

5.1.1 ODROID-XU3The Hardkernel ODROID-XU3 development board has an

8-core Samsung Exynos 5422 chip with an ARM big.LITTLEarchitecture and 2GB of LPDDR3 RAM at 933MHz. Thechip has four Cortex-A15 cores at 2.0GHz and four Cortex-A7 cores at 1.4GHz. The four Cortex-A15 cores form acluster with a shared 2MB L2 cache, and the Cortex-A7share a 512KB L2 cache. The two clusters are coherent, soa single shared memory application can run on both clusters,using up to eight cores simultaneously. In our experiments,

we evaluate a set of possible combinations of fast and slowcores varying the total number of cores from two to eight.For the remainder of the paper, we refer to Cortex-A15 coresas big and to Cortex-A7 cores as little.

5.1.2 SimulationTo evaluate CATS on larger multi-core systems we use the

heterogeneous multi-core TaskSim simulator [24]. TaskSimallows the specification of a heterogeneous system with twodifferent types of cores: fast and slow. We can configure theamount of cores of each type and the difference in perfor-mance between the different types (performance ratio) in theTaskSim configuration file. In our experiments, we evaluatethe effectiveness of CATS on a total of 80 distinct heteroge-neous machine configurations. These comprise systems withthe total number of cores ranging from 16 to 128, and thenumber of fast cores ranging from 1 to 16. For all theseconfigurations, we evaluate the following performance ratiosbetween fast and slow cores: 2×, 2.5×, 3×, 3.5× and 4×.

For both real and simulated platforms, each set-up has agiven number of total and big cores. Our metrics are theimprovement of CATS over the baseline scheduler, and thespeedup over the execution on one little core. Equations 1and 2 show the improvement and speedup calculations.

Improv. over BF(total, big) =Exec. time BF(total, big)

Exec. time CATS(total, big)(1)

Speedup(total, big) =Exec. time(1, 0)

Exec. time(total, big)(2)

The execution time in these calculations is the average of10 executions of the application on each machine set-up.

5.2 ApplicationsWe use four scientific kernels implemented in the OmpSs

programming model: Cholesky factorization, QR factoriza-tion, Heat diffusion and Integral Histogram. These bench-marks are accessible in the BSC Application Repository [6].Their different configurations and characteristics are shownin Table 1.

The applications have different sensitiveness depending onthe inter-task dependencies that determine the available par-allelism and the percentage of critical tasks, shown in thetable. The larger proportion of critical tasks, the larger thepotential improvement of CATS, as these are tasks that,once accelerated, reduce the application overall executiontime. The percentage of critical tasks in Table 1 is froma single-core execution of CATS. Since CATS is dynamic,the tasks considered critical would be different for differentconfigurations, since the algorithm runs on a partial depen-dency graph, including only the outstanding tasks. However,

Page 7: Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architecturespersonals.ac.upc.edu/arico/papers/ics15_heterosched... · 2015-05-11 · Criticality-Aware Dynamic Task Scheduling

2

3

4 5 678 910

12 14 161820 22

24

27 303336 39

42

465054 58

62

6772 77

82

88 94

100

107

114

11

13 15 171921 23

25 28 313437 4043 475155 59636873 788389 95101 108 115

26

29 323538 41

44 485256 60646974 798490 96102 109 116

45

495357 61

657075 808591 97103 110 117

66

7176 81

8692 98104 111 118

87

93 99

105 112 119

106

113

120

121

(a) 8×8 blocks

2

3

4

5 67 8910 11 1213 1415 1617

18

20

22 2426 283032 34 3638 4042 4446

48

51 5457 606366 69 7275 7881 8487

90

9498 102106110 114 118122 126130 134138

142

147 152157162 167 172177 182187 192197

202

208214220 226 232238 244250 256262

268

275282 289 296303 310317 324331

338

346 354 362370 378386 394402

410

419 428437 446455 464473

482

492

19

21

23 2527 293133 35 3739 4143 4547 49

52 5558 616467 70 7376 7982 8588 91 9599 103107111 115 119123 127131 135139 143148 153158163 168 173178 183188 193198 203 209215221 227 233239 245251 257263 269276283 290 297304 311318 325332 339347 355 363371 379387 395403 411 420 429438 447456 465474 483 49350

53 5659 626568 71 7477 8083 8689

92 96100 104108112 116 120124 128132 136140 144149 154159164 169 174179 184189 194199 204 210216222 228 234240 246252 258264 270277284 291 298305 312319 326333 340348 356 364372 380388 396404 412 421 430439 448457 466475 484 494

93

97101 105109113 117 121125 129133 137141

145150 155160165 170 175180 185190 195200 205 211217223 229 235241 247253 259265 271278285 292 299306 313320 327334 341349 357 365373 381389 397405 413 422 431440 449458 467476 485 495505 515525 535545 555566 577588 599610 621 633645 657669 681694 707720 733 747

146

151 156161166 171 176181 186191 196201

206 212218224 230 236242 248254 260266 272279286 293 300307 314321 328335 342350 358 366374 382390 398406 414 423 432441 450459 468477 486 496506 516526 536546 556567 578589 600611 622 634646 658670 682695 708721 734 748

207

213219225 231 237243 249255 261267

273280287 294 301308 315322 329336 343351 359 367375 383391 399407 415 424 433442 451460 469478 487 497507 517527 537547 557568 579590 601612 623 635647 659671 683696 709722 735 749763 777792807

274

281288 295 302309 316323 330337

344352 360 368376 384392 400408 416 425 434443 452461 470479 488 498508 518528 538548 558569 580591 602613 624 636648 660672 684697 710723 736 750764 778793808

345

353 361 369377 385393 401409

417 426 435444 453462 471480 489 499509 519529 539549 559570 581592 603614 625 637649 661673 685698 711724 737 751765 779794809

418

427 436445 454463 472481

490 500510 520530 540550 560571 582593 604615 626 638650 662674 686699 712725 738 752766 780795810

491

501511 521531 541551

561572 583594 605616

502

503

504

627 639651 663675

512

513

514

687700 713726

522

523

524

739 753767

532

533

534

781796

542

543

544

811

552

553

554

562

573 584595 606617

563

564

565

628 640652 664676

574

575

576

688701 714727

585

586

587

740 754768

596

597

598

782797

607

608

609

812

618

619

620

629

641653 665677

630

631

632

689702 715728

642

643

644

741 755769

654

655

656

783798

666

667

668

813

678

679

680

690

703 716729

691

692

693

742 756770

704

705

706

784799

717

718

719

814

730

731

732

743

757771

744

745

746

785800

758

759

760

761

762

815

772

773

774

775

776

786

801

787

788

789

790

791

816

802

803

804

805

806

817

(b) 16×16 blocks

Figure 3: Cholesky factorization task dependency graphs.

the single-core figures serve as an estimation of the percent-age of critical tasks intrinsic to the overall application, thatallows us to reason about the results with respect to thisapplication characteristic.

The performance ratio between big and little cores de-pends on the application. For example, the difference be-tween the issue rate and throughput of double-precision float-ing point units of both types of cores is larger than the differ-ence for single-precision floating point instructions. There-fore, applications with heavy double-precision operation geta larger benefit from running on the big cores, than single-precision dominated applications, as shown in Table 1.

The performance ratios in Table 1 are computed for thereal machine by comparing the execution time on one littlecore over the execution time on one big core. Using theseperformance ratios, we can estimate the ideal speedup overa little core for each application running on all eight cores.Equation 3 is used for the ideal speedup. The ideal speedupconsiders a fully parallel workload, without any dependen-cies, overheads or sequential sections, thus unachievable bythe dependency-intensive applications in our evaluation.

ideal speedup(workload, 8) = 4 × perf ratio(workload) + 4 (3)

The average overhead per task for each scheduler is negligi-ble compared to the average task execution time shown inTable 1. Specifically, it is observed that the per-task over-head ranges in the order of tens of microseconds for BF andin the order of hundreds of microseconds for dHEFT andCATS when running on 8 cores. The most time consumingphase of CATS is the task prioritization, in which the TDGis traversed to update the bottom levels of the tasks. FordHEFT, the search of the appropriate worker for a task be-comes an obstacle in performance. Normally these obstaclesin heterogeneous schedulers are paid off by the more effectivetask execution.

5.2.1 Cholesky FactorizationCholesky factorization is a dense matrix operation that

is used for the efficient solution of linear equations in lin-ear least square systems. The OmpSs implementation ofCholesky blocks the input matrix into square blocks of floatsand it then performs the factorization on each block usingtasks that call the functions of the Intel MKL library [16].

Figure 3 shows the TDGs for input sizes of (a) 8×8 and (b)16×16 blocks. Critical tasks are denoted as red nodes. TheTDG becomes wider as the number of blocks increases. Thisreduces the percentage of critical tasks as shown in Table 1.The 8×8 blocks case shows a narrower TDG that makes theapplication more criticality sensitive than the 16×16 blockscase that exposes more parallelism. We evaluate both con-

figurations to show the impact of scheduling on differentcriticality sensitiveness of the application configuration.

5.2.2 QR FactorizationQR Factorization is a linear algebra algorithm that is often

used to solve the linear least squares problem [8]. We eval-uate the performance of a blocked, communication avoid-ing QR factorization implementation in OmpSs. We usean input matrix of 8192×8192 doubles and a block size of512×512, that creates 16×16 blocks.

5.2.3 Heat DiffusionHeat diffusion benchmark uses the Gauss-Seidel method

to compute the heat distribution on a matrix A from heatsources represented by x. Heat diffusion implements an it-erative solver of the equation that invokes the Gauss-Seidelmethod until the desired convergence is reached. In the spe-cific implementation [6] the number of iterations can be setas an argument in order to control the workload of the exper-iments. The code splits the matrix into blocks and createsone task per block for the Gauss-Seidel computation. We usea matrix of 8192×8192 doubles and block size of 512×512.We set the number of iterations to 20.

5.2.4 Integral HistogramThe integral histogram is a method to compute a cumula-

tive histogram for each pixel of an image [7]. The OmpSs im-plementation performs a blocked cross-weave scan for eachblock of the image. The horizontal scan processes one imageblock at a time and transmits the histograms to the block onthe right. The vertical scan processes one block and trans-mits the histograms to the block below the current block.Due to the histograms’ transmissions, the application intro-duces many task dependencies. We evaluate Integral His-togram with an input image of 4096×4096 pixels and blocksize of 512×512, resulting in an 8×8 blocked matrix.

5.3 Real Environment EvaluationFigure 4 shows the results from the experiments on the

ODROID-XU3. Figure 4a shows the speedup of CATS (SSFLEX), dHEFT and BF when running the applications onall eight cores, as well as the estimated ideal speedup of theplatform for each application. Cholesky and Integral His-togram operate on single-precision data, while QR and HeatDiffusion operate on double-precision. Double-precision ap-plications get larger speedups over one little core becausethey benefit from a larger performance ratio when runningon a big core. For all cases, CATS scales better than dHEFTand BF. The shortening of the critical path by running allcritical tasks on big cores effectively shortens total executiontime when running on all cores.

Figure 4b shows the improvement of the CATS configura-tions and dHEFT over BF, and the speedup obtained withCATS, dHEFT and BF, for Cholesky on an 8×8 blocked ma-trix. CATS consistently achieves better performance thandHEFT and BF and the improvement over BF increases asthe number of cores is increased. Specifically, the improve-ment is observed to be up to 30% when running on sevenand eight cores.

Figure 4c shows the performance on a 16×16 block in-put matrix, where the improvement of CATS is smaller andranges from 2 to 9% and all schedulers perform fairly well.

Page 8: Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architecturespersonals.ac.upc.edu/arico/papers/ics15_heterosched... · 2015-05-11 · Criticality-Aware Dynamic Task Scheduling

0

5

10

15

20

25

Cholesky 8x8 Cholesky16x16

Int. Hist. QR Heat

Single Precision Double Precision

Spee

du

p o

ver

littl

e

Number of big coresTotal number of cores

IDEAL CATS dHEFT BF

(a) Speedup of CATS, dHEFT and BF on 8 cores comparedto the ideal

0123456789

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1 1 1 2 1 2 3 3 4

2 3 4 5 6 7 8

Spee

du

p o

ver

littl

e (l

ines

)

Imp

rove

me

nt

ove

r B

F (b

ars)

Number of big coresTotal number of cores

SS FLEX SS STRICT 2DS FLEX 2DS STRICT dHEFT BF050

1 2 3 4 5 6 7 8 9

SS FLEX SS STRICT 2DS FLEX 2DS STRICT dHEFT

(b) Cholesky 8×8

0

2

4

6

8

10

12

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1 1 1 2 1 2 3 3 4

2 3 4 5 6 7 8

Spee

du

p o

ver

littl

e (l

ines

)

Imp

rove

me

nt

ove

r B

F (b

ars)

Number of big coresTotal number of cores

SS FLEX SS STRICT 2DS FLEX 2DS STRICT dHEFT BF050

1 2 3 4 5 6 7 8 9

SS FLEX SS STRICT 2DS FLEX 2DS STRICT dHEFT

(c) Cholesky 16×16

0

2

4

6

8

10

12

14

16

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1 1 1 2 1 2 3 3 4

2 3 4 5 6 7 8

Spee

du

p o

ver

littl

e (l

ines

)

Imp

rove

me

nt

ove

r B

F (b

ars)

Number of big coresTotal number of cores

SS FLEX SS STRICT 2DS FLEX 2DS STRICT dHEFT BF050

1 2 3 4 5 6 7 8 9

SS FLEX SS STRICT 2DS FLEX 2DS STRICT dHEFT

(d) QR

0

2

4

6

8

10

12

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1 1 1 2 1 2 3 3 4

2 3 4 5 6 7 8

Spee

du

p o

ver

littl

e (l

ines

)

Imp

rove

me

nt

ove

r B

F (b

ars)

Number of big coresTotal number of cores

SS FLEX SS STRICT 2DS FLEX 2DS STRICT dHEFT BF050

1 2 3 4 5 6 7 8 9

SS FLEX SS STRICT 2DS FLEX 2DS STRICT dHEFT

(e) Heat Diffusion

0

1

2

3

4

5

6

7

0.95

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1 1 1 2 1 2 3 3 4

2 3 4 5 6 7 8

Spee

du

p o

ver

littl

e (l

ines

)

Imp

rove

men

t o

ver

BF

(bar

s)

Number of big coresTotal number of cores

SS FLEX SS STRICT 2DS FLEX 2DS STRICT dHEFT BF050

1 2 3 4 5 6 7 8 9

SS FLEX SS STRICT 2DS FLEX 2DS STRICT dHEFT

(f) Integral Histogram

Figure 4: Schedulers comparison on ODROID-XU3

In this case the opportunities for enhancement are limited,since, according to Figure 4a, BF performance approachesthe ideal speedup. The lower improvement in this case comesfrom the fact that the application is less sensitive to the crit-ical path. The task graph is wider (as shown in Figure 3)and, accordingly, the percentage of critical tasks is lower.However, CATS still outperforms BF by 7% when using theeight cores in the system and performs as good as dHEFT.

Figure 4d shows the improvement of CATS and dHEFTover BF and their speedup for QR factorization. QR consistsof double precision operations which cause the big cores to be4.26× faster than the little cores. Thus, the ideal speedup ofthe system for QR is 20.1×. CATS achieves a 15× speedupby shortening the execution of the dynamic longest path inthe TDG. Since dHEFT is not focusing on the longest pathof the TDG, and it performs some random scheduling at thebeginning of the execution in order to determine the taskcosts, it becomes less effective than CATS but still betterthan BF as the number of cores increases.

Figure 4e shows the improvement of the different CATSconfigurations and dHEFT over BF, and the speedup ob-tained, for heat diffusion. As shown in Table 1 heat diffu-sion consists of 5124 tasks; 5120 tasks of them are tasks ofthe same type and parameter size. This limits the effect ofdHEFT, that tries to find the earliest executor for each taskin comparison to BF in which whenever a core becomes avail-able, it retrieves a task from the ready queue. Thus, dHEFTis better than BF because it uses private ready queues foreach core, thing that reduces the congestion of using onlyone ready queue. CATS consistently improves the schedul-ing of the asymmetric system from 15% to 22%, since themain criterion of scheduling is the TDG structure. CATSachieves a 13× speedup when using all eight cores, thus get-ting very close to the ideal 15.3× shown in Figure 4a.

Figure 4f shows the improvement of CATS and dHEFTover BF, and the speedup obtained, for integral histogram.The impact of CATS is again positive for all configurations,since the improvement is at least 5% with the peak be-

Page 9: Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architecturespersonals.ac.upc.edu/arico/papers/ics15_heterosched... · 2015-05-11 · Criticality-Aware Dynamic Task Scheduling

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

1.45

1.5

1 2 4 8 2 4 8 16 4 8 16 32 8 16 32 64

16 32 64 128

Imp

rove

me

nt

ove

r B

F

Number of big coresTotal number of cores

perf 2 perf 2.5 perf 3 perf 3.5 perf 4

Figure 5: Cholesky 32×32 SS FLEX simulation; perf x isthe performance ratio between fast and slow cores

ing 15% for 8 cores. Experiments on larger systems showthat the specific application’s scalability saturates beyond16 cores. This is because the intensive dependencies be-tween tasks reduce the available parallelism that is mainlypresent among the diagonal-placed blocks due to the cross-weave scanning process. Since dHEFT does not take intoaccount the dependencies between tasks, the scheduling ofan application with limited parallelism due to dependenciesbecomes trickier. Moreover, the performance ratio of thissingle-precision application is 1.7×, so the ideal speedup islimited to 10.8×.

An interesting observation is that for single-precision ker-nels the improvement of CATS is proportional to the per-centage of critical tasks. Integral histogram achieves greaterimprovements in comparison to Cholesky 16×16 since itschedules immediately more tasks to the fast cores of thesystem, according to the percentage of the critical tasks onTable 1. On the other hand, QR and heat diffusion showthe opposite effect: QR has the largest performance ratioand a larger percentage of critical tasks than heat diffusion.However, heat diffusion shows higher overall improvementover the different configurations. We attribute this to thelarger sensitivity of heat diffusion to the critical path, whichallows CATS to achieve a large improvement even for a con-figuration of one big and one little core. The dHEFT sched-uler outperforms BF but struggles to handle applicationswith complex TDG. Applications that consist of multiple in-stances of the same task type benefit less from dHEFT thanapplications that process tasks with variable costs, such asCholesky and QR.

In all cases, the SS FLEX configuration of CATS achievesthe best performance, since it produces a decent amount ofcritical tasks for the big cores, fact that shortens the longestpath of the TDG. A smaller amount of critical tasks is pro-duced by the SS STRICT policy, which causes a slight im-balance that is fixed through the work stealing mechanismbut with lower effectiveness. The 2DS FLEX configuration,produces the same amount of critical tasks as the SS FLEX,but the bi-directional work stealing allows little cores to stealcritical tasks, which lengthens the critical path executionand directly increases overall execution time.

5.4 SimulationsTo estimate the impact of CATS on larger systems, we run

three of the benchmarks using the TaskSim simulator [24].Integral Histogram is excluded from the simulated evalua-tion because it does not scale beyond 16 cores. Also, dHEFT

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1 2 4 8 2 4 8 16 4 8 16 32 8 16 32 64

16 32 64 128

Imp

rove

me

nt

ove

r B

F

Number of big coresTotal number of cores

perf 2 perf 2.5 perf 3 perf 3.5 perf 4

Figure 6: QR SS FLEX simulation; perf x is the performanceratio between fast and slow cores

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

1 2 4 8 2 4 8 16 4 8 16 32 8 16 32 64

16 32 64 128

Imp

rove

me

nt

ove

r B

FNumber of big cores

Total number of cores

perf 2 perf 2.5 perf 3 perf 3.5 perf 4

Figure 7: Heat diffusion SS FLEX simulation; perf x is theperformance ratio between fast and slow cores

is not included in the simulation experiments because thesimulation platform does not support informing task execu-tion costs to the scheduler.

Simulation allows us to evaluate on larger systems andmultiple performance ratios. The results contain a fixedscheduling overhead for all configurations, regardless of thedynamic overheads during execution (e.g., work stealing).The error introduced by this assumption is small because,as stated above, the average scheduling overheads in ourapplications are negligible compared to the average task ex-ecution time. For space purposes, we only show simulationresults for the best CATS configuration: SS FLEX.

Figures 5, 6 and 7 show the improvement of CATS overBF in systems with multiple performance ratios (perf x) forCholesky, QR and heat diffusion respectively. The impactof CATS is larger for systems with higher performance ratiobetween fast and slow cores. This makes sense, as a fasterexecution of the critical path directly shortens overall ex-ecution time. In addition, CATS utilises fast cores moreeffectively than BF, which results in larger improvementswith larger number of fast cores. In Figure 5, the improve-ment for 16 cores is comparatively small. This is becausethe problem size used in this specific experiment is a matrixof 16384×16384 floats using 512×512 blocks, which createsa 32×32 blocked matrix. This is because the other Choleskyconfigurations do not scale to 128 cores. The small amountof critical tasks in the 32×32 input, as shown in Table 1,makes the workload less sensitive to critical tasks and lim-its the improvement of CATS for Cholesky to a maximumof 1.4×, while QR and heat diffusion are improved by fac-tors of 1.6× and 2.7×. Again, heat diffusion gets the largestimprovement due to its larger sensitiveness to criticality, aswas also shown in the real platform evaluation.

Page 10: Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architecturespersonals.ac.upc.edu/arico/papers/ics15_heterosched... · 2015-05-11 · Criticality-Aware Dynamic Task Scheduling

6. CONCLUSIONWe introduced the first criticality-aware dynamic schedul-

ing policy for heterogeneous environments. Contrary to pre-vious works on criticality-aware scheduling that use syn-thetic task graphs and require previous knowledge of profil-ing information, our proposal works on real platforms withreal applications, is implementable and works without theneed of an oracle or profiling.

We implemented and evaluated our criticality-aware taskscheduler in the runtime system of the OmpSs program-ming model getting satisfactory results. The implementa-tion shown in this paper will be included in the next stablerelease of OmpSs. Furthermore, there are no restrictions onapplying our policy to other task-based programming mod-els with support for task dependencies.

From our experiments on a real heterogeneous multi-coreplatform, we found a consistent performance improvementover the default breadth-first scheduling policy and a dy-namic implementation of Heterogeneous Earliest Finish Time.The improvement of our proposal, which in most cases rangesfrom 10 to 20% and reaches up to 30%, is larger as we in-crease the number of cores. This gives a positive projectionfor CATS, as it is expected that the number of cores in multi-cores will increase throughout future generation designs.

From our simulation experiments, we found out that theimprovement of CATS increases over the baseline with largerdifferences of performance among fast and slow cores. Weexplored performance ratios between two and four timesfaster fast cores over slow cores, with improvements rang-ing from 30% to 170%.

In conclusion, this paper shows the potential of a criticality-aware scheduler to speed up dependency-intensive applica-tions and take advantage of the asymmetric compute re-sources. For future work, we aim to extend the schedulingpolicy to be adaptive so it can dynamically adjust its flexi-bility and work stealing policy depending on the applicationcharacteristics and availability of resources at runtime.

AcknowledgementsThe authors would like to thank Paul Carpenter and AlexRamirez for their help to improve the quality of this paper.This work has been supported by the grant SEV-2011-00067of the Severo Ochoa Program, awarded by the Spanish Gov-ernment, by the Spanish Ministry of Science and Innovation(contracts TIN2012-34557, and CAC2007-00052), by Gen-eralitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and by the Mont-Blanc 2 project (FP7-610402).

7. REFERENCES[1] T. L. Adam, K. M. Chandy, and J. R. Dickson. A Comparison

of List Schedules for Parallel Processing Systems. Commun.ACM, 17(12), 1974.

[2] A. Agarwal and P. Kumar. Economical Duplication Based TaskScheduling for Heterogeneous and Homogeneous ComputingSystems. IACC 2009, 2009.

[3] C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier.StarPU: A Unified Platform for Task Scheduling onHeterogeneous Multicore Architectures. Concurr. Comput. :Pract. Exper., 23(2), 2011.

[4] E. Ayguade, R. Badia, P. Bellens, D. Cabrera, A. Duran,R. Ferrer, M. Gonzalez, F. Igual, D. Jimenez-Gonzalez,J. Labarta, L. Martinell, X. Martorell, R. Mayo, J. Perez,J. Planas, and E. Quintana-Ortı. Extending OpenMP to

Survive the Heterogeneous Multi-Core Era. InternationalJournal of Parallel Programming, 38(5-6), 2010.

[5] S. Bansal, P. Kumar, and K. Singh. An Improved DuplicationStrategy for Scheduling Precedence Constrained Graphs inMultiprocessor Systems. Parallel and Distributed Systems,IEEE Transactions on, 14(6), 2003.

[6] Barcelona Supercomputing Center. Barcelona ApplicationRepository. Available online on April 18th, 2014.

[7] P. Bellens, K. Palaniappan, R. Badia, G. Seetharaman, andJ. Labarta. Parallel Implementation of the Integral Histogram.In Advanced Concepts for Intelligent Vision Systems, volume6915 of Lecture Notes in Computer Science. 2011.

[8] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. Paralleltiled QR factorization for multicore architectures. Technicalreport, 2007.

[9] M. Daoud and N. Kharma. Efficient Compile-Time TaskScheduling for Heterogeneous Distributed Computing Systems.ICPADS 2006, 2006.

[10] A. Duran, E. Ayguade, R. M. Badia, J. Labarta, L. Martinell,X. Martorell, and J. Planas. Ompss: a Proposal forProgramming Heterogeneous Multi-Core Architectures. ParallelProcessing Letters, 21, 2011.

[11] A. Duran, J. Corbalan, and E. Ayguade. Evaluation ofOpenMP Task Scheduling Strategies. IWOMP’08, 2008.

[12] A. Duran, J. M. Perez, E. Ayguade, R. M. Badia, andJ. Labarta. Extending the OpenMP Tasking Model to AllowDependent Tasks. IWOMP’08, 2008.

[13] A. Fedorova, J. C. Saez, D. Shelepov, and M. Prieto.Communications of the ACM, (12).

[14] P. Greenhalgh. big.LITTLE Processing with ARM Cortex-A15& Cortex-A7. ARM White Paper, 2011.

[15] M. Hakem and F. Butelle. Dynamic Critical Path SchedulingParallel Programs onto Multiprocessors. IPDPS’05, 2005.

[16] Intel Corporation. Reference Manual for Intel Math KernelLibrary 11.1 .

[17] M. A. Iverson, F. Ozguner, and G. J. Follen. ParallelizingExisting Applications in a Distributed HeterogeneousEnvironment. HCW’95, 1995.

[18] P. Kogge, K. Bergman, S. Borkar, D. Campbell, W. Carson,W. Dally, M. Denneau, P. Franzon, W. Harrod, K. Hill, andOthers. Exascale Computing Study: Technology Challenges inAchieving Exascale Systems. Technical report, University ofNotre Dame, CSE Dept., 2008.

[19] K. Li, Z. Zhang, Y. Xu, B. Gao, and L. He. Chemical ReactionOptimization for Heterogeneous Computing Environments.ISPA, 2012.

[20] C.-H. Liu, C.-F. Li, K.-C. Lai, and C.-C. Wu. A dynamicCritical Path Duplication Task Scheduling Algorithm forDistributed Heterogeneous Computing Systems. volume 1 ofICPADS 2006, 2006.

[21] A. Page and T. Naughton. Dynamic Task Scheduling usingGenetic Algorithms for Heterogeneous Distributed Computing.In Parallel and Distributed Processing Symposium, 2005.Proceedings. 19th IEEE International, 2005.

[22] J. A. Pienaar, S. Chakradhar, and A. Raghunathan. AutomaticGeneration of Software Pipelines for Heterogeneous ParallelSystems. SC ’12, 2012.

[23] J. Planas, R. Badia, E. Ayguade, and J. Labarta. Self-AdaptiveOmpSs Tasks in Heterogeneous Environments. IPDPS, 2013.

[24] A. Rico, F. Cabarcas, C. Villavieja, M. Pavlovic, A. Vega,Y. Etsion, A. Ramirez, and M. Valero. On the Simulation ofLarge-Scale Architectures Using Multiple ApplicationAbstraction Levels. ACM Trans. Archit. Code Optim., 8(4).

[25] H. Topcuoglu, S. Hariri, and M.-Y. Wu. Performance-Effectiveand Low-Complexity Task Scheduling for HeterogeneousComputing. IEEE Transactions on Parallel and DistributedSystems, 13(3), 2002.

[26] M.-Y. Wu and D. Gajski. Hypertool: a Programming Aid forMessage-Passing Systems. Parallel and Distributed Systems,IEEE Transactions on, 1(3), 1990.

[27] T. Yang and A. Gerasoulis. DSC: Scheduling Parallel Tasks onan Unbounded Number of Processors. Parallel and DistributedSystems, IEEE Transactions on, 5(9), 1994.

[28] H. Yu. A Hybrid GA-based Scheduling Algorithm forHeterogeneous Computing Environments. SCIS’07, 2007.

[29] Z. Zong, A. Manzanares, X. Ruan, and X. Qin. EAD andPEBD: Two Energy-Aware Duplication Scheduling Algorithmsfor Parallel Tasks on Homogeneous Clusters. Computers, IEEETransactions on, 60(3), 2011.