runtime system and scheduling support for high-end cpu-gpu architectures vignesh ravi
DESCRIPTION
Runtime System and Scheduling Support for High-End CPU-GPU Architectures Vignesh Ravi Dept. of Computer Science and Engineering Advisor: Gagan Agrawal. The Death of Single-core CPU Scaling. Until 2004. The Landscape of Computing – Moore’s Law. Double the # of Transistors - PowerPoint PPT PresentationTRANSCRIPT
Runtime System and Scheduling Support for High-End CPU-GPU Architectures
Vignesh RaviDept. of Computer Science and Engineering
Advisor: Gagan Agrawal
1
The Death of Single-core CPU Scaling
2
The Landscape of Computing – Moore’s Law
Transistors
Clock Speed
Power
Efficiency
• Double the # of Transistors• Simply increase clock frequency• Of course! Consume more
power• Significantly improved
efficiency• Follows Moore’s law
The Free Lunch is over !• Single Core clock frequency
reaches a plateau• End of Moore’s law …• Alternate processor design
required
Until 2004
Since 2005, Now and Future…
• The rise of Multi-core, Many-core architectures …
• Parallel programming …
Rise of Multi-core, Many-core …
3
Multi-core CPUs
Many-core GPUs
Executive-like: More room for control logics2 – 12 coresClock speed: ~ 1.8 GHz – 3.3 GHz
Massive arithmetic, least controlSpecialized Co-processingIn the range of 512 coresClock speed: ~ 1.2 GHz
GFL
OP
S
Rise of Heterogeneous Architectures
• Today’s High Performance Computing– Multi-core CPUs, Many-core GPUs are mainstream
• Many-core GPUs offer– Excellent “price-performance”& “performance-per-watt”– Financial modeling, Gas and Oil exploration, Medical …
• Flavors of Heterogeneous computing– Multi-core CPUs + GPUs connected over PCI-E– Accelerated Processing Units (APU) , AMD Fusion– Intel MIC, Sandy Bridge, Nvidia Denver …
• Heterogeneous Architectures are pervasive– Supercomputers &Clusters, Clouds, Desktops, Notebooks,
Tablets, Mobiles … 4
Today’s Computing Platforms are Heterogeneous!
New Challenges are Emerging …
New Challenges
5
CPU +
GPU
Heterogeneous
Architecture
Appl
icat
ion(
s)
Question 1: How to benefit from CPU and GPU simultaneously? CPU/GPU Work
Distribution Module
Concurrency control/Synchronization
between CPU/GPU
Question 2: Improve utilization of GPUs?
Enable Sharing of GPU across diff.
apps.
Question 3: Job Scheduling for hetero. clusters?
Revisit Job scheduling for CPU-
GPU clusters
Question 4: Mechanisms to debug and profile GPU programs?
Tools development for GPUs
My Thesis Focus
6
CPU +
GPU
Heterogeneous
Architecture
Appl
icat
ion(
s)
CPU/GPU Work Distribution
Module
Concurrency control/Synchronization
between CPU/GPU
Enable Sharing of GPU across diff.
apps.
Revisit Job scheduling for CPU-
GPU clusters
Tools development for GPUs
Primary Focus
Thesis Contributions
Support for GPU Sharing across Multiple Applications
•Supporting GPU Sharing with a Transparent Runtime Consolidation Framework (HPDC 2011)
7
Runtime Systems and Dynamic Work Distribution for Heterogeneous Systems
• Compiler and Runtime Support for Enabling Generalized Reductions on Heterogeneous Systems (ICS 2010)• A Dynamic Scheduling Framework for Emerging Heterogeneous Systems (HiPC 2011)
Job Scheduling for Heterogeneous Clusters•Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes (CCGrid 2012)•Value-Based Scheduling Framework for Modern Heterogeneous Clusters (Under Submission)
Today’s Talk
Support for GPU Sharing across Multiple Applications
•Supporting GPU Sharing with a Transparent Runtime Consolidation Framework (HPDC 2011)
8
Runtime Systems and Dynamic Work Distribution for Heterogeneous Systems
• Compiler and Runtime Support for Enabling Generalized Reductions on Heterogeneous Systems (ICS 2010)• A Dynamic Scheduling Framework for Emerging Heterogeneous Systems (HiPC 2011)
Job Scheduling for Heterogeneous Clusters•Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes (CCGrid 2012)•Value-Based Scheduling Framework for Modern Heterogeneous Clusters (Under Submission for SC 2012)
Pre-Candidacy Work
Post-Candidacy Work
9
Outline of Presentation
• Recap of Pre-Candidacy work• Runtime system and Work Distribution• GPU Sharing Through Runtime Consolidation Framework
• Post-Candidacy work • Concurrent Job Scheduling to Improve Global Throughput• Value-based Job Scheduling
• Future Work• Thesis Conclusions
10
Outline of Presentation
• Recap of Pre-Candidacy work• Runtime system and Work Distribution• GPU Sharing Through Runtime Consolidation Framework
• Post-Candidacy work • Concurrent Job Scheduling to Improve Global Throughput• Value-based Job Scheduling
• Future Work• Thesis Conclusions
Motivation
• In HPC, demand for computing is ever increasing– CPU+GPU platform expose huge raw processing power
• Top 6 Supercomputers– Heterogeneous - utilization is under ~50% – Homogeneous - utilization is about 80%
• Application development for multi-core CPU and GPU is still independent– “No established mechanism” to exploit aggregate power
• Can computations benefit from simultaneously utilizing CPU and GPU?
11
Runtime System and Work Distribution for CPU-GPU Architectures
• Focus on specific classes of computation patterns– Generalized Reduction Structure– Structured Grid Computations
• Improve application developer productivity– Facilitate High-Level API support– Hide parallelization difficulties through runtime support
• Improve efficiency – Dynamic work distribution between CPU & GPU
• Show significant performance improvements – Up to 63% for generalized reduction structures– Up to 75% for structured grid computations
12
13
Outline of Presentation
• Recap of Pre-Candidacy work• Runtime system and Work Distribution• GPU Sharing Through Runtime Consolidation
Framework• Post-Candidacy work
• Concurrent Job Scheduling to Improve Global Throughput• Value-based Job Scheduling
• Future Work• Thesis Conclusions
Motivation
• Emergence of Cloud – “Pay-as-you-go” model– Cluster instances, High-speed interconnects for HPC users– Amazon, Nimbix, SoftLayer - GPU instances
• Sharing is the basis of cloud, GPU no exception– Multiple virtual machines may share a physical node
• Modern GPUs are expensive than multi-core CPUs– Fermi cards with 6 GB memory, 4000 $– Need better resource utilization
• Modern GPUs expose high degree of parallelism– Applications may not utilize full potential
14
Sharing a GPU is necessary, but how?
GPU Sharing Through Runtime Consolidation Framework
• Software Framework to enable GPU Sharing– Extended Open Source Call Interception Tool, gVirtuS– GPU sharing through kernel consolidation & virtual context
• Basic GPU-Sharing Mechanisms– Time- and Space-Sharing
• Solutions to GPU Kernel Consolidation Problem– Affinity score, to predict benefit upon consolidation– Kernel Molding policies, to handle high resource contention– Overall scheduling algorithm for multiple GPUs
• Show significant global throughput improvements – Up to 50% improvement using advanced sharing policies
15
16
Outline of Presentation
• Recap of Pre-Candidacy work• Runtime system and Work Distribution• GPU Sharing Through Runtime Consolidation Framework
• Post-Candidacy work • Concurrent Job Scheduling to Improve Global
Throughput• Value-based Job Scheduling
• Future Work• Thesis Conclusions
Motivation
• Software Stack to program CPU-GPU arch. has evolved– Combination of (Pthreads/OpenMP…) + (CUDA/Stream)– Now, OpenCL is becoming more popular
• OpenCL, a device agnostic platform– Offers great flexibility with portable solutions– Write kernel once, execute on any device
• Supercomputers and Cloud environments are typically “Shared”– Accelerate a set of applications as opposed to single application– “Job Scheduler” is a critical component of software stack
• Today’s schedulers (like TORQUE) for hetero. clusters:– DO NOT exploit the portability offered by OpenCL– User-guided Mapping of jobs to hetero. resources– Does not consider desirable & advanced scheduling possibilities
17
Revisit Scheduling problems for CPU-GPU clusters1) Exploit portability offered by models like
OpenCL2) Automatic mapping of jobs to resources3) Desirable advanced scheduling considerations
Problem Formulations
Problem Goal:• Accelerate a set of applications on CPU-GPU cluster• Each node has two resources: A Multi-core CPU
and a GPU• Map applications to resources to:
– Maximize overall system throughput– Minimize application latency
Scheduling Formulations:1) Single-Node, Single-Resource Allocation &
Scheduling2) Multi-Node, Multi-Resource Allocation & Scheduling
18
Scheduling Formulations
• Allocates a multi-core CPU or a GPU from a node in cluster– Benchmarks like Rodinia (UV) & Parboil (UIUC) contain 1-
node apps.– Limited mechanisms to exploit CPU+GPU simultaneously
• Exploit the portability offered by OpenCL prog. Model
19
Single-Node, Single-Resource Allocation & Scheduling
Multi-Node, Multi-Resource Allocation & Scheduling• In addition, allows CPU+GPU allocation
– Desirable in future to allow flexibility in acceleration of applications
• In addition, allows multiple node allocation per job• MATE-CG [IPDPS’12], a framework for Map-Reduce
class of apps. allows such implementations
Challenges and Solution Approach
Decision Making Challenges:• Allocate/Map to CPU-only, GPU-only, or CPU+GPU?• Wait for optimal resource (involves queuing delay)• Assign to non-optimal resource (involves penalty)• Always allocating CPU+GPU may affect global
throughput– Should consider other possibilities like CPU-only or GPU-only
• Always allocate requested # of nodes?– May increase wait time, can consider allocation of lesser
nodesSolution Approach:• Take different levels of user inputs (relative
speedups, execution times…)• Design scheduling schemes for each scheduling
formulation20
Scheduling Schemes for First Formulation
21
Two Input Categories & Three Schemes: Categories are based on the amount of input
expected from the userCategory 1: Relative Multi-core (MP) and GPU (GP)
performance as inputScheme1: Relative Speedup based w/ Aggressive Option (RSA)Scheme2: Relative Speedup based w/ Conservative Option (RSC)
Category 2: Additionally, sequential CPU exec. Time (SQ)Scheme3: Adaptive Shortest Job First (ASJF)
Relative-Speedup Aggressive (RSA) or Conservative (RSC)
22
N Jobs, MP[n], GP[n]
Create CJQ, GJQEnqueue Jobs in Q’s(GP-MP)
Sort CJQ and GJQ in Desc. Order
R=GetNextResourceAvialable()
IsGPU
GJQ Empty?
YesNo
Assign GJQtop to R
Yes
Assign CJQbottom to R Wait for CPU
Aggressive?
Takes multi-core and GPU speedup as input• Create CPU/GPU
queues• Map jobs to optimal
resource queue
Aggressive, minimizes penalty
ConservativeYes No
Adaptive Shortest Job First (ASJF)
23
N Jobs, MP[n], GP[n], SQ[N]
Create CJQ, GJQEnqueue Jobs in Q’s(GP-MP)
Sort CJQ and GJQ in Asc. Order of Exec. Time
R=GetNextResourceAvialable()
IsGPU
GJQ Empty?
Yes
NoAssign GJQtop to R
YesT1= GetMinWaitTimeForNextCPU()
T2k= GetJobWithMinPenOnGPU(CJQ)
T1 > T2kAssign CJQk to R
Yes
No Wait for CPU to become free or for GPU jobs
Minimize latency for short jobs
Automatic switch for aggressive or conservative option
Scheduling Scheme for Second Formulation
24
Solution Approach:• Flexibly schedule on CPU-only, GPU-only, or
CPU+GPU• Molding the # of nodes requested by job
• Consider allocating ½ or ¼th of requested nodesInputs from User:• Execution times of CPU-only, GPU-only, CPU+GPU• Execution times of jobs with n, n/2, n/4 nodes• Such app. Information can also be obtained from
profiles
Flexible Moldable Scheduling Scheme (FMS)
25
N Jobs, Exec. Times…
Group Jobs with # of Nodes as the Index
Sort each group based on exec. time of CPU+GPU version
Pick a pair of jobs to schedule in order of sorting
Minimize resource fragmentationHelps co-locate CPU and GPU job on the same node
Gives global view to co-locate on same node
Find the fastest completion option from T(i,n,C), T(i,n,G), T(i,n,CG) for each
job
Choose C for one job & G for the other
Co-locate jobs on same set of nodes
Choose same resource for both jobs (C,C)
(G,G) (CG,CG)
2N Nodes Avail?
YesSchedule pair of jobs in parallel
on 2N nodes
No Consider Molding by Res. Type if CG
Consider Molding # of nodes for the next job
Cluster Hardware Setup
26
• Cluster of 16 CPU-GPU nodes• Each CPU is 8 core Intel Xeon E5520
(2.27GHz)• Each GPU is an Nvidia Tesla C2050 (1.15
GHz)• CPU Main Memory – 48 GB• GPU Device Memory – 3 GB• Machines are connected through Infiniband
Benchmarks
27
Single-Node Jobs• We use 10 benchmarks
• Scientific, Financial, Datamining, Image Processing applications
• Run each benchmark with 3 different exec. Configurations
• Overall, a pool of 30 jobsMulti-Node Jobs• We use 3 applications
• Gridding kernel, Expectation-Maximization, PageRank• Applications run with 2 different datasets and on 3
different node numbers• Overall, a pool of 18 jobs
Baselines & Metrics
28
Baseline for Single-Node Jobs• Blind Round Robin (BRR)• Manual Optimal (Exhaustive search, Upper Bound)Baseline for Multi-Node Jobs• TORQUE, a widely used resource manager for hetero. clusters• Minimum Completion Time (MCT), [Maheswaran et.al, HCW’99]
Metrics• Completion Time (Comp. Time)• Application Latency:
• Non-optimal Assignment (Ave. NOA. Lat)• Queuing Delay (Ave. QD Lat.)
• Maximum Idle Time (Max. Idle Time)
Single-Node Job Results
29
Uniform CPU-GPU Job Mix
CPU-biased Job Mix
0.0
1.0
2.0
3.0
4.0
5.0
6.0
Comp. Time Ave. NOA Lat. Ave. QD Lat. Max. Idle Time
Nor
mal
ized
Ove
r Bes
t Cas
e
Metrics
BRR RSA RSC ASJF Manual Optimal
0.01.02.03.04.05.06.07.0
Comp. Time Ave. NOA Lat. Ave. QD Lat. Max. Idle Time
Nor
mal
ized
Ove
r Bes
t Cas
e
Metrics
BRR RSA RSC ASJF Manual Optimal
• 24 Jobs on 2 NodesProposed
schemes
4 different metrics
For each metric
• 108% better than BRR• Within 12% of Manual
Optimal• Tradeoff between non-
optimal penalty vs wait-time for resource• BRR has the highest latency
• RSA, non-optimal penalty• RSC, high Queue delay• ASF as good as Manual
optimal• BRR, very high idle times• RSC, can be very high too• RSA has the best utilization
among proposed schemes
Multi-Node Job Results
30
Varying Job Execution Lengths
Varying Resource Request Size
0.60.70.80.9
11.11.21.31.41.5
75 SJ/25 LJ 50 SJ/50 LJ 25 SJ/75 LJ
Nor
mal
ized
Com
pleti
on T
ime
Job Mix
Torque MCT
Molding ResType Only Molding NumNodes Only
Molding ResType+NumNodes(FMS)
0.60.70.80.9
11.11.21.31.4
75 SR/25 LR 50 SR/50 LR 25 SR/75 LR
Nor
mal
ized
Com
pleti
on T
ime
Job Mix
Torque MCTMolding ResType Only Molding NumNodes OnlyMolding ResType+NumNodes(FMS)
Short Job (SJ), Long Job (LJ)
Small Request (SJ), Large Request (LJ)
Proposed schemes • 32 Jobs on 16
Nodes• FMS, 42% better than best of Torque or MCT
• Each type of molding gives reasonable improvement
• Our schemes utilizes the resource betterhigh throughput
• Intelligent on deciding to wait for res. or mold it for smaller res.• FMS, 32% better than best of Torque or MCT
• Benefit from ResType Molding is better than NumNodes Molding
Summary
31
• Revisit scheduling problems on CPU-GPU clusters• Goal to improve aggregate throughput• Single-node, single-resource scheduling problem• Multi-node, multi-resource scheduling problem
• Developed novel scheduling schemes• Exploit portability offered by OpenCL• Automatic mapping of jobs to hetero. resources• RSA, RSC, and ASJF for single-node jobs• Flexible Molding Scheduling (FMS) for multi-node
jobs• Significant improvement over state-of-the-art
32
Outline of Presentation
• Recap of Pre-Candidacy work• Runtime system and Work Distribution• GPU Sharing Through Runtime Consolidation Framework
• Post-Candidacy work • Concurrent Job Scheduling to Improve Global Throughput• Value-based Job Scheduling
• Future Work• Thesis Conclusions
Motivation • Previously, goal to improve overall global throughput & latency• Other desirable goals for supercomputer and cloud environments
– Market-based scheduling goals (providers’ profit and user-satisfaction)– For eg., MOAB (with SLAs) for supercomputers and large clusters– For eg., Amazon classifies as Free, Spot, On-Demand, Reserved users– Each user has different levels of importance and satisfaction
• Supercomputer, clouds engage massively parallel resources– Multi-core CPUs with 16 cores, GPUs with 512 cores– Recent announcements of MIC (about 50-60 cores) in stampede– Efficient resource utilization is important
• Today’s schedulers (like TORQUE) for hetero. clusters:– No notion of market-based scheduling– User-guided Mapping of jobs to hetero. resources– Lack ability/schemes to share massively parallel resources
33
Revisit Scheduling problems for CPU-GPU clusters1) Exploit portability offered by models like
OpenCL2) Automatic mapping of jobs to resources3) Market-based scheduling considerations4) Schemes to enable automatic sharing of
resources
Value Function
34
• Each job is attached with a value function• Linear-Decay Value Function [Irwin et.al HPDC’04]
– Maximum Value Importance/priority – Decay Rate Urgency
• Value function with different shapes– Can represent different SLAs, eg. Step function
• Yield is obtained after job completion, defined as
• Delay can be a sum of any of four components– Queuing, non-optimal penalty, sharing 1-core penalty, sharing CPU/GPU
penalty• Yield represents both “Providers’ profit” as well as “User-
satisfaction”
Yield = maxValue – decay * delay
We believe that value function provides rich, yet, simple formulation for market-based
scheduling
Scheduling Problem Formulation
• Given hetero. cluster with each node containing:– 1 multi-core CPU and 1 GPU
• Schedule a set of jobs on the cluster– To maximize the aggregate yield
• Allocates a multi-core CPU or a GPU from a node in cluster– Does not allocate both multi-core CPU and GPU to a job– Does not allocate multiple nodes to a job– Considerations for future work
• Exploit the portability offered by OpenCL prog. Model– Flexibly map the job on to either CPU or GPU
• Allow sharing of multi-core CPU or GPU– Up to two jobs per resource– Limited to space-sharing 35
Overall Scheduling Approach
36
Jobs arrive in batches
Enqueue into CPU Queue Enqueue into GPU Queue
Execute on CPU Execute on GPU
Push job in to its optimal resource queue and sortInitial Mapping and Ordering
When both job queues are non-empty
• Resource (CPU) is free• But, job (CPU) queue is
empty• Resource will be idle• Propose various schemes
for dynamic re-mapping
Sort jobs to improve yield Sort jobs to improve yield
Heuristics for Different Stages
• Initial mapping & Ordering of queues– Initial assignment of jobs to queue: Based on optimal walltime– Sorting of jobs in the queue: Adapt Reward [Earlier Work: HPDC’04] to
our formulation• Dynamic Re-mapping of jobs to Non-optimal Resource
– Uncoordinated Schemes (Three new heuristics)• Last Optimal Reward (LOR)• First Non-Optimal Reward (FNOR)• Last Non-optimal Reward Penalty (LNORP)
– Coordinated Schemes (One new heuristic)• Coordinated Least Penalty (CORLP)
• Sharing jobs on a single type of resource (One New heuristic)– Scalability-Decay Factor, Top K fraction [K is tunable]
37
Sorting Jobs in the Queues• Reward heuristic is based on two market-based terms
– Present (Discounted Gain) Value– Opportunity Cost
• Present Value (PV)– Value gain after time ‘t’, after discounting risk of running the job– Receiving $1,000 now is worth more than $1,000 five years from now– Shorter the job, lower the risk
• Opportunity Cost (Cost)– Degradation cost of an alternative to pursue a certain action – Prefer high decay jobs over low decay jobs– In our case, cost of choosing a job ‘i’ over a job ‘j’
• Reward– Choose the job with highest reward to schedule on the corresponding resource
38
PVi / OptimalWTi = yieldi / (1+dis_rate*OptimalWTi)
Costi / OptimalWTi = Σ decayj – decayi j=0
n
Rewardi = (PVi – Costi) / OptimalWTi
Dynamic Remapping – Uncoordinated Schemes
• Only when the resource is idle, and job queue is empty– Idle resources reduce utilization, hence overall yield (considering waiting
jobs in other queue)– Dynamically assign a job to non-optimal resource from optimal queue for
that job• Three Schemes based on two key aspects
– Which job will have best reward on non-optimal resource?– Which job will suffer least reward penalty ?
1. Last Optimal Reward (LOR)– Exploits “Reward score” computed on each queue for each job– Simply chooses job with least reward from the optimal resource queue– Anyway least reward on optimal resource, least risk in moving– O(N) to seek the last job in the queue
39
Dynamic Remapping – Uncoordinated Schemes
2. First Non-Optimal Reward (FNOR)– Compute the reward job could produce on non-optimal resource– Explicitly considers non-optimal penalty– Job with highest reward on non-optimal resource– O(Nlog(N)) to sort the newly computed reward
3. Last Non-Optimal Reward Penalty (LNORP)– FNOR fails to consider reward degradation– LNORP computes reward degradation on non-optimal resource– Moves the job with least reward degradation
40
Suff_factori = Non-OptimalWTi / OptimalWTiNon-OptimalRewardi = OptimalRewardi / Suff_factori
Non-OptimalRewardPenalty = OptimalRewardi - NonOptimalRewardi
Dynamic Remapping – Coordinated Scheme
• Even when resource is not idle, and job queue is non-empty– May be necessary to move job from one queue to another due to imbalance– Better global view of both the queues
• Factors affecting imbalance,– Decay rates of jobs across queues– Execution lengths (or queuing delays) of jobs across queues
• For coordination across queue,– Determine when coordination is required– If coordination required, heuristic for “which” job to move
• Detecting when coordination is required– Total Queuing-Delay Decay-Rate Product (TQDP), for each queue ‘i’
• Heuristic for picking a job to move– Move the job with least non-optimal penalty
• Coordinated Least Penalty (CORLP)41
TQDPi = Σ Queuing_delayj * decayj j=0
n
Heuristic for Sharing• Allow up to two jobs to space-share a resource
– For eg., on a multi-core CPU with 8 cores, 2 jobs each use 4 cores– Penalties from time-sharing can be high due to more resource contention
• Factors affecting sharing– Jobs will use half the resources, will incur a slowdown– On the other hand, more resources may be available
• Jobs/applications– Can be categorized as low, medium, high scaling (based on models/profiling) – Some jobs are less urgent than the other
• “When” to enable sharing?– Large fraction of jobs in pending queues with negative yield
• “Who” are the candidates to share? (Scalability-DecayRate factor)– Jobs grouped in the order of low to high scalability– Within each group, jobs are ordered by decay rate– Pick top K fraction of jobs, ‘K’ is tunable (low scalability, low
decay) 42
43
Master Node
Cluster Level Scheduler
Scheduling Schemes & Policies
TCPCommunicator
Submission Queue
Pending Queues
Execution Queues
Finished Queues
Compute Node
Node Level Scheduler
Multi-core CPU GPU
Compute Node
Node Level Scheduler
Multi-core CPU GPU
…
TCP Communicator
CPU Jobs Exec. Thread(s)
GPU Jobs Exec. Thread(s)
GPU ConsolidationFramework
Register
High-Level Scheduler Framework Design
44
Front End – Back End Communication Channel
GPU1 GPUn…
Interception Library
CUDA App2
InterceptionLibrary
CUDA App1
CUDA Driver
CUDA Runtime
GPU Consolidation Framework
Front End
Back End
BackEnd Server
Dispatcher
VirtualContext
VirtualContext
Workload Consolidator
Workload Consolidator
Queues Workloads to
Dispatcher
Queues Workloads to
Virtual Context Ready Queue
Workloads arrive from Frontend
GPU Sharing Framework
Cluster Hardware Setup
45
• Cluster of 16 CPU-GPU nodes• Each CPU is 8 core Intel Xeon E5520 (2.27GHz), Main memory 48GB• Each GPU is an Nvidia Tesla C2050 (1.15 GHz), Device memory 3GB
Cluster Hardware Setup
• We use 10 benchmarks• Scientific, Financial, Datamining, Image Processing applications
• Run each benchmark with 3 different exec. Configurations• Overall, a pool of 30 jobs
Benchmarks
Baselines• TORQUE, a widely used resource manager for hetero. clusters• Minimum Completion Time (MCT), [Maheswaran et.al, HCW’99]Metrics• Completion Time• Application Latency• Average Yield
Comparison with Torque-based Metrics
46
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Comp. Time-UM Comp. Time-BM Ave. Lat-UM Ave. Lat-BM
Nor
mal
ized
over
Bes
t Cas
e
Metrics
TORQUE MCT LOR FNOR LNORP CORLP
• Baselines and our schemes use two different set of metrics
• See how our schemes perform with Torque-based metrics
• In all cases, we run 256 jobs on a 16-node cluster
10% better 22% better
• Efficient use of resources (no idle time)
• Idle time outweighs non-optimal penalty
• Worse with biased-mix (BM)
20% better
• Our schemes may prefer short jobs, reducing latency
• Also minimizes non-optimal penalty
• Also reduces queuing delay
Results with Average Yield Metric
47
0
1
2
3
4
5
6
7
8
Linear Decay Step Decay
Rela
tive
Ave.
Yie
ld
Value Decay Function
TORQUE MCT LOR FNOR LNORP CORLP
0
2
4
6
8
10
25C/75G 50C/50G 75C/25G
Rela
tive
Ave.
Yie
ld
CPU/GPU Job Mix Ratio
Torque MCT FNOR LNORP LOR CORLP
Varying CPU-GPU Job Mix
Impact of Value Decay Functions
25% CPU Jobs, 75% GPU Jobs
Up to 8.8x better
• Biased cases very high improvement
• More room for idle times and dynamic mapping
• 2.3x better for even uniform mix• Torque, no notion of value• Our schemes order the jobs for yield• Eliminates the idle time for
resources
• Adaptability of the proposed schemes to different shapes of value functions
Up to 3.8x better
Up to 6.9x better
• Step decay is more coarse-grained, hence improvement is better
Results with Average Yield Metric
48
0
2
4
6
8
10
128 256 384 512
Rela
tive
Ave.
Yie
ld
Total No. of Jobs
TORQUE MCT LOR FNOR LNORP CORLP
0
0.5
1
1.5
2
CPU(75LE/75HD) & GPU(75LE/75HD)
CPU(50LE/50HD)& GPU(75LE/75HD)
CPU(25LE/25HD) & GPU(75LE/75HD)
Rela
tive
Ave.
Yie
ld
CPU/GPU Queue Parameters Ratio
LOR FNOR LNORP CORLP
Impact of Varying Load
Coordinated Vs Uncoordinated Schemes
• As load increases, yield from baselines decreases linearly
• Proposed schemes achieve initially increased yield and then sustained yield
• As it tries to maximize the yield
Up to 8.2x better
Why do we need coordination?• Imbalance in decay rate or queuing
delays across queues
As the imbalance increase, improvement from CORLP increases
Up to 78% better
Yield Improvements from Sharing
49
0
5
10
15
20
25
0.1 0.2 0.3 0.4 0.5 0.6
Yiel
d Im
prov
emen
t (%
)
Sharing K Factor
CPU only Sharing Benefit GPU only Sharing Benefit
CPU & GPU Sharing Benefit
Effect of Sharing K Fraction
Fraction of Job to share
• Benefit from freeing a resource is always offset by the slowdown incurred by sharing jobs
• Benefit increase up to a point, then decreases (K=0.5 in this case)
• Emphasizes careful selection of K Fraction
• Up to 23% improvement due to sharing
Overhead of Sharing a CPU Core
50
0
1
2
3
4
5
6
7
8
9
10
1 2 3 4 5 6 Geo. Mean
Ove
rhea
d (%
)
Job Mix Number
• A CPU core is shared b/w a CPU and GPU jobs scheduled on the same node
• Overhead is within 10%• Variation depends on the amount or
frequency of data transfer/commn. b/w CPU and GPU
Summary
51
• Value-based Scheduling on CPU-GPU clusters• Goal to improve aggregate yield
• Developed novel scheduling schemes for dynamic mapping• Three Uncoordinated schemes• One Coordinated scheme
• Enable automatic sharing of resources including GPU• One novel heuristic for sharing
• Framework for evaluating the proposed schemes• Significant improvement over state-of-the-art• Based on completion time & latency• Based on average yield
52
Outline of Presentation
• Recap of Pre-Candidacy work• Runtime system and Work Distribution• GPU Sharing Through Runtime Consolidation Framework
• Post-Candidacy work • Concurrent Job Scheduling to Improve Global Throughput• Value-based Job Scheduling
• Future Work• Thesis Conclusions
Future Work
• Industry making move towards integrated CPU-GPU architectures– Intel recently announced SandyBridge for Servers– AMD opened up its HSA roadmap for APUs
• In HPC segment, discrete CPU-GPU will continue• Machines with integrated GPU as well as a discrete
GPU– For instance, announcement for Stampede supercomputer– Important to understand the benefits of one architecture
over the other
53
Future Work (contd.)
• OpenCL, open standard for heterogeneous computing– Gaining momentum owing to its maturity (Spafford et. al., ORNL's
Scalable Heterogeneous Computing Benchmark Suite (SHOC))– “Write kernel once, execute on many devices” is very attractive– Work distribution, communication across devices are explicit
• Build Library and Runtime support for OpenCL– Overarching Goal: Enable deployment of application(s) on a
large cluster of heterogeneous nodes– A task/work size driven approach for work distribution and
scheduling– Tasks transparently “map to” and “scale” on: multi-cores,
integrated and discrete GPUs
54
55
Outline of Presentation
• Recap of Pre-Candidacy work• Runtime system and Work Distribution• GPU Sharing Through Runtime Consolidation Framework
• Post-Candidacy work • Concurrent Job Scheduling to Improve Global Throughput• Value-based Job Scheduling
• Future Work• Thesis Conclusions
Thesis Conclusions
• Heterogeneity is the order of today’s computing• New challenges - Node, clusters, cloud environments
– Increased architecture complexity for developers– Lack of desired software features and mechanisms
• Runtime Library support to enable various computation patterns– Less application developer burden, improved performance
• Runtime Consolidation Framework to enable GPU Sharing– Improved global throughput in heavily shared environments
• Revisited Job Scheduling problems– Novel schemes to improve global throughput– Novel schemes to improve market-based metric
56
57
Thank You!Questions?
Benchmarks – Large Dataset
58
BenchmarksSeq. CPU Exec. (sec)
GPU Speedup (GP)
Multicore Speedup (MP)
Data set Characteristics
PDE Solver 7.3 4.7 6.814336*14336Image Processing 33.8 5.1 7.814336*14336FDTD 8.4 2.2 7.614336*14336
BlackScholes 2.6 2.1 7.210 mil optionsBinomial Options 11.8 5.6 4.21024 optionsMonteCarlo 45.4 38.4 7.91024 options
Kmeans 330.0 12.1 7.81.6 * 10 ^ 9 points
KNN 67.3 7.8 6.267108864 pointsPCA 142.0 9.7 5.6262144*80
Molecular Dynamics 46.6 12.9 7.9
256000 nodes, 31744000 edges
Benchmarks – Small Dataset
59
BenchmarksSeq. CPU Exec. (sec)
GPU Speedup (GP)
Multicore Speedup (MP)
Data set Characteristics
PDE Solver 1.8 3.8 7.17168*7168Image Processing 8.4 5.6 7.57168*7168FDTD 2.1 1.3 7.77168*7168
BlackScholes 0.7 0.6 6.82.5 mil optionsBinomial Options 3.0 2.3 4.2128 optionsMonteCarlo 11.0 9.4 7.9256 options
Kmeans 74.2 6.3 7.70.4*10 ^ 9 points
KNN 16.8 2.9 6.216777216 pointsPCA 33.8 9.1 5.665536*80
Molecular Dynamics 6.7 12.8 7.3
32000 nodes, 3968000 edges
Benchmarks – Large No. of Iterations
60
BenchmarksSeq. CPU Exec. (sec)
GPU Speedup (GP)
Multicore Speedup (MP)
Data set Characteristics
PDE Solver 722.1 4.3 8.114336*14336Image Processing 3385.5 4.8 8.014336*14336FDTD 423.3 1.8 7.914336*14336
BlackScholes 269.1 92.8 7.810 mil optionsBinomial Options 1213.6 12.2 4.31024 optionsMonteCarlo 453.3 368.5 7.81024 options
Kmeans 1593.8 12.6 7.91.6 * 10 ^ 9 points
KNN 1691.1 58.4 6.967108864 pointsPCA 2835.7 11.8 6.2262144*80
Molecular Dynamics 593.8 20.8 7.8
256000 nodes, 31744000 edges
61
Frequency (%)No. of Jobs LOR FNOR LNORP CORLP
64 15.6 12.5 15.6 18.8128 10.9 11.7 11.7 14.8256 9.4 12.1 10.2 15.6512 9.6 9.6 9.0 13.1
Decay Ratio Job Type Improvement in User Satisfaction (%)MCT LOR FNOR LNORP CORLP
25% H & 75% L High Decay 5.3 78.6 84.6 83.8 104.2Low Decay 6.9 35.7 45.5 47.8 54.1
50% H & 50% H
High Decay 11.7 114.3 118.8 124.7 144.4Low Decay 11.8 58.8 58.9 66.1 72.2
75% H & 25% L High Decay 14.9 69.7 73.1 86.8 107.4Low Decay 16.2 20.5 22.4 31.3 33.8
0123456789
10
1 2 3 4 5 6 Geo. Mean
Ove
rhea
d (%
)
Job Mix Number
No. of Jobs Yield Improvement (%)128 18.2256 20.1384 22.3512 22.9