scheduling generic parallel applications –meta-scheduling
DESCRIPTION
Scheduling Generic Parallel Applications –Meta-scheduling. Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References” slide. Scheduling Architectures. Centralized schedulers Single-site scheduling – a job does not span across sites Multi-site – the opposite - PowerPoint PPT PresentationTRANSCRIPT
Scheduling Generic Parallel Scheduling Generic Parallel Applications –Meta-Applications –Meta-
schedulingscheduling
Sathish VadhiyarSathish Vadhiyar
Sources/Credits/Taken from: Sources/Credits/Taken from: Papers listed in “References” slidePapers listed in “References” slide
Scheduling ArchitecturesScheduling Architectures
Centralized schedulersCentralized schedulersSingle-site scheduling – a job does not span across sitesSingle-site scheduling – a job does not span across sitesMulti-site – the oppositeMulti-site – the opposite
Hierarchical structures - A central scheduler Hierarchical structures - A central scheduler (metascheduler) for global scheduling and local (metascheduler) for global scheduling and local scheduling on individual sitesscheduling on individual sites
Decentralized scheduling – distributed schedulers Decentralized scheduling – distributed schedulers interact, exchange information and submit jobs to interact, exchange information and submit jobs to remote systemsremote systems
Direct communication – local scheduler directly contacts Direct communication – local scheduler directly contacts remote schedulers and transfers some of its jobsremote schedulers and transfers some of its jobsCommunication via central job pool – jobs that cannot be Communication via central job pool – jobs that cannot be immediately executed are pushed to a central pool, immediately executed are pushed to a central pool, other local schedulers pull the jobs out of the poolother local schedulers pull the jobs out of the pool
Various Scheduling ArchitecturesVarious Scheduling Architectures
Various Scheduling ArchitecturesVarious Scheduling Architectures
Metascheduler across MPPsMetascheduler across MPPs
TypesTypes CentralizedCentralized
A meta scheduler and local dispatchersA meta scheduler and local dispatchersJobs submitted to meta schedulerJobs submitted to meta scheduler
HierarchicalHierarchicalCombination of central and local schedulersCombination of central and local schedulersJobs submitted to meta schedulerJobs submitted to meta schedulerMeta scheduler sends job to the site for which earliest Meta scheduler sends job to the site for which earliest start time is expectedstart time is expectedLocal schedulers can follow their own policiesLocal schedulers can follow their own policies
DistributedDistributedEach site has a metascheduler and a local schedulerEach site has a metascheduler and a local schedulerJobs submitted to local metaschedulerJobs submitted to local metaschedulerJobs can be transffered to sites with lowest loadJobs can be transffered to sites with lowest load
Evaluation of schemesEvaluation of schemesCentralized
Hierarchical
Distributed
1. Global knowledge of all resources – hence optimized schedules
2. Can act as a bottleneck for large number of resources and jobs
3. May take time to transfer jobs from meta scheduler to local schedulers – need strategic position of meta scheduler
1. Medium level overhead
2. Sub optimal schedules
3. Still need strategic position of central scheduler
1. No bottleneck – workload evenly distributed
2. Needs all-to-all connections between MPPs
Evaluation of Various Scheduling Evaluation of Various Scheduling ArchitecturesArchitectures
Experiments to evaluate slowdowns in the 3 Experiments to evaluate slowdowns in the 3 schemesschemesBased on actual trace from a supercomputer centre Based on actual trace from a supercomputer centre – 5000 job set– 5000 job set4 sites were simulated – 2 with the same load as 4 sites were simulated – 2 with the same load as trace, other 2 where run time was multiplied by 1.7trace, other 2 where run time was multiplied by 1.7FCFS with EASY backfilling was usedFCFS with EASY backfilling was usedslowdown = (wait_time + run_time) / run_timeslowdown = (wait_time + run_time) / run_time2 more schemes2 more schemes
Independent – when local schedulers acted independently, Independent – when local schedulers acted independently, i.e. sites are not connectedi.e. sites are not connected
United – resources of all processors are combined to form United – resources of all processors are combined to form a single sitea single site
ResultsResults
ObservationsObservations1. Centralized and hierarchical performed slightly better than uniteda. Compared to hierarchical, scheduling decisions have to be
made for all jobs and all resources in united – overhead and hence wait time is highb. Comparing united and centralized.
i. 4 categories of jobs corresponding to 4 different combinations of 2 parameters – execution time (short, long) and number of resources requested (narrow, wide)
ii. Usually larger number of long narrow jobs than short wide jobs
iii. Why is centralized and hierarchical better than united?2. Distributed performed poorly
a. Short narrow jobs incurred more slowdown
b. short narrow jobs are large in number and best candidates for back filling
c. Back filling dynamics are complex
d. A site with an average light may not always be the best choice. SN jobs may find earliest holes in a heavily loaded site.
Newly Proposed ModelsNewly Proposed Models
K-distributed modelK-distributed model Distributed scheme where local metascheduler Distributed scheme where local metascheduler
distributes jobs to k least loaded sitesdistributes jobs to k least loaded sites When job starts on a site, notification is sent to When job starts on a site, notification is sent to
the local metascheduler which in turn asks the the local metascheduler which in turn asks the k-1 schedulers to dequeuek-1 schedulers to dequeue
K-Dual queue modelK-Dual queue model 2 queues are maintained at each site – one for 2 queues are maintained at each site – one for
local jobs and other for remote jobslocal jobs and other for remote jobs Remote jobs are executed only when they Remote jobs are executed only when they
don’t affect the start times of the local jobsdon’t affect the start times of the local jobs Local jobs are given priority during backfillingLocal jobs are given priority during backfilling
Results – Benefits of new schemesResults – Benefits of new schemes
45% improvement 15% improvement
Results – Usefulness of K-Dual Results – Usefulness of K-Dual schemescheme
Grouping jobs submitted at lightly loaded sites and Grouping jobs submitted at lightly loaded sites and heavily loaded sitesheavily loaded sites
Assessment and Enhancement of Assessment and Enhancement of Meta-Schedulers…(Sabin et. al.)Meta-Schedulers…(Sabin et. al.)Metascheduling working examples (LSF Metascheduling working examples (LSF and Moab)and Moab)2 different modes:2 different modes: Standard or centralized (all scheduling Standard or centralized (all scheduling
decisions are made in a centralized manner)decisions are made in a centralized manner)Forces local sites to accept advance reservations Forces local sites to accept advance reservations from the metaschedulerfrom the metascheduler
DelegatedDelegatedDoes not provide a known scheduling policy for Does not provide a known scheduling policy for grid jobsgrid jobs
CentralizedCentralized
Metascheduler queries local schedulers to obtain Metascheduler queries local schedulers to obtain information regarding current scheduleinformation regarding current scheduleMetascheduler makes advance reservation on the “best” Metascheduler makes advance reservation on the “best” of local schedulersof local schedulersReservations honored by local sites possibly delaying Reservations honored by local sites possibly delaying local jobslocal jobsMetascheduler tries to find better reservations for all jobs Metascheduler tries to find better reservations for all jobs at periodic intervalsat periodic intervalsIf a better reservation is found, metascheduler cancels If a better reservation is found, metascheduler cancels existing reservation and moves job to another local existing reservation and moves job to another local schedulerschedulerThis model requires close interactions between local and This model requires close interactions between local and metaschedulersmetaschedulers
DelegatedDelegated
Metascheduler determines “best” site for each Metascheduler determines “best” site for each grid jobgrid jobDelegates scheduling responsibilities to local Delegates scheduling responsibilities to local schedulersschedulersAfter the job is sent to the local site, there is no After the job is sent to the local site, there is no interaction between meta and local schedulerinteraction between meta and local schedulerMeta scheduler “queries” the local scheduler for Meta scheduler “queries” the local scheduler for the metric that serves as basis for site choicethe metric that serves as basis for site choiceThis model is more scalable and allows local This model is more scalable and allows local schedulers to retain autonomyschedulers to retain autonomy
EvaluationEvaluationSystem wide average response timeSystem wide average response time
Centralized outperforms delegated since centralized revisits its scheduling decisions
EvaluationEvaluationAverage response time of jobs from the least loaded siteAverage response time of jobs from the least loaded site
•Metascheduling has a detrimental effect on users at the least loaded site
•At low loads, centralized is best – jobs submitted at a least loaded site may run faster at another site
•This is a case of least loaded sites getting discouraged from joining the grid!
To avoid deterioration at least To avoid deterioration at least loaded sites: Dues Based Queuesloaded sites: Dues Based QueuesGoal is to improve priority of jobs originating Goal is to improve priority of jobs originating from lightly loaded sitesfrom lightly loaded sitesFor each site-pair, relative resource usage For each site-pair, relative resource usage surplus/deficit is maintainedsurplus/deficit is maintainedEach site maintains processor seconds that it Each site maintains processor seconds that it has provided to other site’s jobs; also processor has provided to other site’s jobs; also processor seconds that its jobs consumed in other sitesseconds that its jobs consumed in other sitessi sets priority for all of sj’s jobs to be dues[sj]si sets priority for all of sj’s jobs to be dues[sj]For lightly loaded sites, it is usually surplus. For lightly loaded sites, it is usually surplus. Hence other sites will have to pay “dues” to Hence other sites will have to pay “dues” to lightly loaded sites by increasing priorities of lightly loaded sites by increasing priorities of jobs submitted at lightly loaded sitesjobs submitted at lightly loaded sites
Dues Based QueuesDues Based Queues
s1 runs a 100 processor second job for s2s1 runs a 100 processor second job for s2 dues[s2] = -100; dues[s1]=100dues[s2] = -100; dues[s1]=100
S2 runs a 300 processor-second job for s1; s2 S2 runs a 300 processor-second job for s1; s2 will be paying the “dues” to s1will be paying the “dues” to s1 dues[s2] = 200; dues[s1] = -200dues[s2] = 200; dues[s1] = -200
Queue order at each site is determined by dues Queue order at each site is determined by dues values of the submitting sitevalues of the submitting siteCan be implemented in centralizedCan be implemented in centralized Dues-based queuing scheme at the meta schedulerDues-based queuing scheme at the meta scheduler
Or delegatedOr delegated Dues based queues at the local schedulerDues based queues at the local scheduler
EvaluationEvaluationSystem wide average response timeSystem wide average response time
Dues-based scheme performs worse than the corresponding schemes
EvaluationEvaluationAverage response time of jobs from least loaded siteAverage response time of jobs from least loaded site
Centralized dues perform the best
Another method: Local Priority with Another method: Local Priority with Job SharingJob Sharing
Dual queueDual queue Dual queue at local schedulersDual queue at local schedulers Local jobs will have higher priority than remote jobsLocal jobs will have higher priority than remote jobs
Dual queue with local copyDual queue with local copy In dual queue model, remote jobs may suffer In dual queue model, remote jobs may suffer
starvationstarvation Jobs from a lightly loaded site sent to a remote site Jobs from a lightly loaded site sent to a remote site
may suffermay suffer In this scheme, all jobs have a copy sent to the In this scheme, all jobs have a copy sent to the
originating site’s scheduler in addition to one remote originating site’s scheduler in addition to one remote sitesite
EvaluationEvaluationSystem wide average response timeSystem wide average response time
Dual queue with local copy performs the best
EvaluationEvaluationAverage response times of jobs from the least loaded siteAverage response times of jobs from the least loaded site
Dual queue with local copy performs as good as nosharing scheme
SummarySummary
ReferencesReferences
A taxonomy of scheduling in general-purpose distributed A taxonomy of scheduling in general-purpose distributed computing systems. IEEE Transactions on Software computing systems. IEEE Transactions on Software Engineering. Engineering. Volume 14 , Issue 2 (February 1988) Pages: 141 - Volume 14 , Issue 2 (February 1988) Pages: 141 - 154 Year of Publication: 1988 154 Year of Publication: 1988 AuthorsAuthors T. L. Casavant J. G. Kuhl T. L. Casavant J. G. KuhlEvaluation of Job-Scheduling Strategies for Grid Evaluation of Job-Scheduling Strategies for Grid ComputingSourceLecture Notes In Computer Science. ComputingSourceLecture Notes In Computer Science. Proceedings of the First IEEE/ACM International Workshop on Proceedings of the First IEEE/ACM International Workshop on Grid Computing. Grid Computing. Pages: 191 - 202 Year of Publication: 2000 Pages: 191 - 202 Year of Publication: 2000 ISBN:3-540-41403-7. Volker Hamscher Uwe Schwiegelshohn ISBN:3-540-41403-7. Volker Hamscher Uwe Schwiegelshohn Achim Streit Ramin YahyapourAchim Streit Ramin Yahyapour"Distributed Job Scheduling on Computational Grids using Multiple "Distributed Job Scheduling on Computational Grids using Multiple Simultaneous Requests" Vijay Subramani, Rajkumar Kettimuthu, Simultaneous Requests" Vijay Subramani, Rajkumar Kettimuthu, Srividya Srinivasan, P. Sadayappan, Proceedings of 11th IEEE Srividya Srinivasan, P. Sadayappan, Proceedings of 11th IEEE Symposium on High Performance Distributed Computing (HPDC Symposium on High Performance Distributed Computing (HPDC 2002), July 20022002), July 2002
ReferencesReferences
Assessment and Enhancement of Meta-Assessment and Enhancement of Meta-Schedulers for Multi-Site Job Scheduling. Schedulers for Multi-Site Job Scheduling. Sabin et. al. HPDC 2005Sabin et. al. HPDC 2005
ReferencesReferences
Vadhiyar, S., Dongarra, J. and Yarkhan, A. “Vadhiyar, S., Dongarra, J. and Yarkhan, A. “GrADSolve - RPC for GrADSolve - RPC for High Performance Computing on the GridHigh Performance Computing on the Grid". ". Euro-Par 2003, 9th Euro-Par 2003, 9th International Euro-Par Conference, ProceedingsInternational Euro-Par Conference, Proceedings, Springer, LCNS , Springer, LCNS 2790, p. 394-403, August 26 -29, 2003.2790, p. 394-403, August 26 -29, 2003.Vadhiyar, S. and Dongarra, J. “Vadhiyar, S. and Dongarra, J. “Metascheduler for the GridMetascheduler for the Grid”. ”. Proceedings of theProceedings of the 11th IEEE International Symposium on High 11th IEEE International Symposium on High Performance Distributed ComputingPerformance Distributed Computing, pp 343-351, July 2002, , pp 343-351, July 2002, Edinburgh, Scotland.Edinburgh, Scotland.Vadhiyar, S. and Dongarra, J. “Vadhiyar, S. and Dongarra, J. “GrADSolve - A Grid-based RPC GrADSolve - A Grid-based RPC system for Parallel Computing with Application-level system for Parallel Computing with Application-level SchedulingScheduling". ". Journal of Parallel and Distributed ComputingJournal of Parallel and Distributed Computing, , Volume 64, pp. 774-783, 2004.Volume 64, pp. 774-783, 2004.Petitet, A., Blackford, S., Dongarra, J., Ellis, B., Fagg, G., Roche, K., Petitet, A., Blackford, S., Dongarra, J., Ellis, B., Fagg, G., Roche, K., Vadhiyar, S. "Numerical Libraries and The Grid: The Grads Vadhiyar, S. "Numerical Libraries and The Grid: The Grads Experiments with ScaLAPACK, " Experiments with ScaLAPACK, " Journal of High Performance Journal of High Performance Applications and SupercomputingApplications and Supercomputing, Vol. 15, number 4 (Winter 2001): , Vol. 15, number 4 (Winter 2001): 359-374. 359-374.
Coallocation in Multicluster Coallocation in Multicluster SystemsSystems
Processor coallocation – allowing jobs to Processor coallocation – allowing jobs to use processors in multiple clusters use processors in multiple clusters simultaneouslysimultaneouslyJobs consist of one or more components Jobs consist of one or more components each of which has to be scheduled on a each of which has to be scheduled on a different clusterdifferent clusterMulti-component jobs scheduled across Multi-component jobs scheduled across different clusters equal to the number of different clusters equal to the number of componentscomponents
Queuing StructuresQueuing Structures
Single central scheduler with one global queue Single central scheduler with one global queue for the entire set of clusters: all clusters submit for the entire set of clusters: all clusters submit single and multi-component jobs to the global single and multi-component jobs to the global queuequeueLocal schedulers with only local queues at the Local schedulers with only local queues at the clusters: each cluster submits single and multi-clusters: each cluster submits single and multi-component jobs to its local queuecomponent jobs to its local queueA global queue for the system and local queues A global queue for the system and local queues for the clusters: a cluster submits single for the clusters: a cluster submits single component jobs to its local queue and multi-component jobs to its local queue and multi-component jobs to the global queuecomponent jobs to the global queue
SchedulingScheduling
Scheduling multi-component jobs: Scheduling multi-component jobs: WorstFitWorstFit Order the job components in decreasing sizeOrder the job components in decreasing size Order the clusters according to decreasing Order the clusters according to decreasing
number of idle processorsnumber of idle processors Traverse one-by-one through both lists trying Traverse one-by-one through both lists trying
to fit job components on clustersto fit job components on clusters Leaves in each cluster as much room as Leaves in each cluster as much room as
possible for subsequent jobs possible for subsequent jobs
SchedulingScheduling
Invoked during job departureInvoked during job departureA queue is enabled when the corresponding A queue is enabled when the corresponding scheduler is allowed to start jobs from the scheduler is allowed to start jobs from the queue. When a queue is enabled, the job at the queue. When a queue is enabled, the job at the head of the queue is scheduled if it fitshead of the queue is scheduled if it fitsWhen a job departs, all or some of the non-When a job departs, all or some of the non-empty queues are enabledempty queues are enabledEnabled queues are repeatedly visited in some Enabled queues are repeatedly visited in some orderorderWhat non-empty queues are enabled and what What non-empty queues are enabled and what order are they visited is defined by a scheduling order are they visited is defined by a scheduling policypolicy
Scheduling PoliciesScheduling PoliciesGS – global scheduler policy with single queueGS – global scheduler policy with single queueLS – each cluster has only local queues. At a job LS – each cluster has only local queues. At a job departure, in which order should the non-empty departure, in which order should the non-empty queues be disabled?queues be disabled? Local schedulers that have not scheduled jobs for the Local schedulers that have not scheduled jobs for the
longest time gets the first chancelongest time gets the first chance
For systems with both global queue and local For systems with both global queue and local queues:queues: GP – global priority. Local queues are enabled only GP – global priority. Local queues are enabled only
when the global queue is emptywhen the global queue is empty LP – local priority. Global queue is only enabled when LP – local priority. Global queue is only enabled when
at least one local queue is empty. In which order at least one local queue is empty. In which order should the local queues and the global queue be should the local queues and the global queue be enabled?enabled?
Global queue is first enabled and then the local queuesGlobal queue is first enabled and then the local queues
Coallocation RulesCoallocation Rules
[no] only single component jobs are [no] only single component jobs are admitted. No coallocationadmitted. No coallocation
[co] both single and multi-component jobs. [co] both single and multi-component jobs. No restrictionNo restriction
[rco] restriction on size of job components.[rco] restriction on size of job components.
[fco] restriction on size and number of job [fco] restriction on size and number of job componentscomponents
TestbedTestbed
DAS system in Netherlands – 5 clusters, 1 DAS system in Netherlands – 5 clusters, 1 72-nodes, other 32-nodes72-nodes, other 32-nodes
Intra cluster communication – Myrinet LAN Intra cluster communication – Myrinet LAN (1200 Mbit/s)(1200 Mbit/s)
Inter cluster communication – 100 Mbit/s Inter cluster communication – 100 Mbit/s WANWAN
EvaluationEvaluation
2 applications2 applications Ensflow – simulating streams and eddies in Ensflow – simulating streams and eddies in
the oceanthe ocean Poisson – solution of 2-D Poisson equationPoisson – solution of 2-D Poisson equation
Execution times measured on DAS
ResultsResults
ConclusionsConclusions
[co] gives the worst performance. Due to [co] gives the worst performance. Due to simultaneous presence of large single-simultaneous presence of large single-component jobs and jobs with many component jobs and jobs with many componentscomponents[rco] and [fco] improve performance[rco] and [fco] improve performanceLS and LP provide best results for LS and LP provide best results for coallocation cases;coallocation cases;Performance of GS is better when there Performance of GS is better when there are only single-component jobsare only single-component jobs
ConclusionsConclusions
Processor co-allocation is beneficial Processor co-allocation is beneficial atleast when the overhead due to wide-atleast when the overhead due to wide-area communication is not higharea communication is not high
Restrictions to the job component sizes Restrictions to the job component sizes and to the number of job components and to the number of job components improve the performance of coallocationimprove the performance of coallocation
ReferenceReference
Scheduling Policies for Processor Scheduling Policies for Processor Coallocation in MultiCluster Systems. Coallocation in MultiCluster Systems. Bucur and Epema. TPDS. July 2007.Bucur and Epema. TPDS. July 2007.
GridRoutine /
ApplicationManager
User
Grid Application Development Grid Application Development Software (GrADS) ArchitectureSoftware (GrADS) Architecture
ResourceSelector
PerformanceModeler
MDSNWS
Matrix size, block size
Resource characteristics,
Problem characteristics
Final schedule – subset of resources
Performance ModelerPerformance ModelerGrid
Routine /Application
Manager
PerformanceModeler
All resources,
Problem parameters
Final schedule – subset of resources
SchedulingHeuristic
SimulationModel
All resources, problem parameters
Final Schedule
Candidate resources Execution cost
The scheduling heuristic passed only those candidate schedules that had “sufficient” memory
This is determined by calling a function in simulation model
Simulation ModelSimulation Model
Simulation of the ScaLAPACK right Simulation of the ScaLAPACK right looking LU factorizationlooking LU factorization
More about the applicationMore about the application Iterative – each iteration corresponding Iterative – each iteration corresponding
to a blockto a block Parallel application in which columns are Parallel application in which columns are
block-cyclic distributedblock-cyclic distributed Right looking LU – based on Gaussian Right looking LU – based on Gaussian
eliminationelimination
OperationsOperations
The LU application in each iteration The LU application in each iteration involves:involves: Block factorization – (ib:n, ib:ib) floating Block factorization – (ib:n, ib:ib) floating
point operationspoint operations Broadcast for multiply – message size Broadcast for multiply – message size
equals approximately n*block_sizeequals approximately n*block_size Each process does its own multiply:Each process does its own multiply:
Remaining columns divided by number of Remaining columns divided by number of processorsprocessors
Back to the simulation modelBack to the simulation modeldouble getExecTimeCost(int matrix_size, int block_size, candidate_schedule){double getExecTimeCost(int matrix_size, int block_size, candidate_schedule){
for(i=0; i<number_of_blocks; i++){for(i=0; i<number_of_blocks; i++){ /* find the proc. Belonging to the column. Note its speed, its connections to other /* find the proc. Belonging to the column. Note its speed, its connections to other
procs. */procs. */ tfact += … /* simulate block factorization. Depends on {processor_speed, tfact += … /* simulate block factorization. Depends on {processor_speed,
machine_load, flop_count of factorization */machine_load, flop_count of factorization */
tbcast += max(bcast times for each proc.) /* scalapack follows split ring broadcast. tbcast += max(bcast times for each proc.) /* scalapack follows split ring broadcast. Simulate broadcast algorithm for each proc. Depends on {elements of matrix to be Simulate broadcast algorithm for each proc. Depends on {elements of matrix to be broadcast, connection bandwidth and latency */broadcast, connection bandwidth and latency */
tupdate += max(matrix multiplies across all proc.) /* depends on {flop count of matrix tupdate += max(matrix multiplies across all proc.) /* depends on {flop count of matrix multiply, processor speed, load} */multiply, processor speed, load} */
}}
return (tfact + tbcast + tupdate);return (tfact + tbcast + tupdate);
}}
GridRoutine /
ApplicationManager
User
Initial GrADS ArchitectureInitial GrADS Architecture
ResourceSelector
PerformanceModeler
AppLauncher
ContractMonitor Application
MDSNWS
Matrix size, block size
Resource characteristics,
Problem characteristics
Problem, parameters, app. Location, final schedule
Performance Model EvaluationPerformance Model Evaluation
GrADS BenefitsGrADS Benefits
MSC ClusterMSC & TORC Cluster5
8
8
877
8 mscs, 7 torcs
8 mscs, 8 torcs
8 mscs, 8 torcs
Even though performance worsened when using multiple clusters, larger problem sizes can be solved without incurring costly disk accesses