an effective dynamic scheduling runtime and tuning system for heterogeneous multi and many-core...
TRANSCRIPT
1
An Effective Dynamic Scheduling Runtime and Tuning
System for HeterogeneousMulti and Many-Core Desktop
Platforms
Authous: Al’ecio P. D. Binotto, Carlos E. Pereira, Arjan Kuijper, Andr’e Stork, and Dieter W. Fellner
ytchen2012.09.19
2
Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion
3
Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion
4
Introduction • High performance platforms are commonly
required for scientific and engineering algorithms dealing appropriately with timing constraints.
• Both computation time and performance need to be optimized.
• Efficiency with respect to both huge domain sizes and with small problems is important.
5
Introduction • Our dynamic scheduling method combines a first
assignment phase for a set of high-level tasks (algorithms, for example), based on a pre-processing benchmark for acquiring basic performance samples of the tasks on the PUs, with a runtime phase that obtains real performance measurements of tasks, and feeds a performance database.
6
Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion
7
Motivation • 3D Computational Fluid Dynamics (CFD)• large computations
o velocity field o local pressure
• Exampleo planeso Cars
8
Motivation• three iterative solvers for SLEs (Jacobi, Red-Black
Gauss-Seidel, and Conjugate Gradient)o Jacobi: determining the solutions of a system of linear
equations with largest absolute values in each row and column dominated by the diagonal element.
o Red-Black Gauss-Seidel: an iterative method used to solve a linear system of equations resulting from the finite difference discretization of partial differential equations.
o Conjugate Gradient: an algorithm for the numerical solution of particular systems of linear equations, namely those whose matrix is symmetric and positive-definite.
9
Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion
10
System overview• Units of Allocation (UA): is represented as a task.
11
Platform Independent Programming Model
• OpenCL• In its basic principle, the API encapsulates
implementations of a task (methods, algorithms, parts of code, etc.) for different PUs, leveraging intrinsic hardware features and making them platform independent.
12
Profiler and Database• profiler monitors and stores tasks’ execution
times and characteristics in a timing performance database.
• input data (size and type), data transfers between PUs, among others.
13
Profiler and Database• The performance is measured in Host (CPU)
counting clocks, which intrinsically takes into account the data transfer times from/to CPU to/from the PU, possible initialization and synchronization times on the PUs, and latency.
14
Dynamic Scheduler • First, it establishes an initial scheduling guess
over the PUs just when the applications(s) starts.o First Assignment Phase – FAP
• Second, for every new arriving task, it performs a scheduling consulting the timing database.o Runtime Assignment Phase – RAP
15
First Assignment Phase – FAP
• Given a set of tasks with predefined costs for the PUs stored at the database, the first assignment phase performs a scheduling of tasks over the asymmetric PUs.
• lowest total execution time: o m: the number of Pus
• m = 2o n: the number of considered taskso i: tasko j: processor
16
17
18
19
20
21
Runtime Assignment Phase - RAP
• Modeled the arriving of new tasks as a FIFO (First In First Out) queue.
• assignment reconfiguration - Tasks that were already scheduled but not executed will change their assignment if it promotes a performance gain.
• When there is no entry for a task with a specific domain size, the lookup function retrieves the data from the task with the most similar domain size.
22
Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion
23
Experiment results• Domain sizes and execution costs of the tasks on
the PUs
24
Experiment results• Comparison of allocation heuristics
o 0-GPU, 1-CPU
25
Experiment results• Overhead of the dynamic scheduling using ALG.2
and its gain in comparison to scheduling all tasks to the GPU
26
Experiment results• Scheduling techniques for 24 tasks
o Overhead: the time to perform the schedulingo Solve time: the execution time to compute the tasks o Total time: overhead + solve timeo Error: the total time of the techniques in comparison to the optimal
solution without it overhead • ex: (7660-6130) / 6130
o Optimal: exhaustive search
27
Experiment results• Scheduling 24 tasks in the FAP + 42 tasks arriving
in the RAP
28
Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion
29
Related work • Distributed processing on a CPU-GPU platform
• Scheduling on a CPU-GPU platformo HEFT (Heterogeneous-Earliest-Finish-Time)
30
Related workStarPU this paper
execution model codelets OpenCL
method low-level high-level
motivation CFD matrix multiplication
system runtime system
scheduling database
31
Outline • Introduction• Motivation• System• Experiment results• Related work • Conclusion
32
Conclusion• This paper presents a context-aware runtime and
tuning system based on a compromise between reducing the execution time of engineering applications.
• We combined a model for a first scheduling based on an off-line performance benchmark with a runtime model that keeps track of the real execution time of the tasks with the goal to extend the scheduling process of the OpenCL.
33
Conclusion• We achieved an execution time gain of 21.77% in
comparison to the static assignment of all tasks to the GPU with a scheduling error of only 0.25% compared to exhaustive search.
34
Thanks for your listening!