the performance of bags-of-tasks in large-scale distributed computing systems
DESCRIPTION
The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems. Alexandru Iosup , Ozan Sonmez, Shanny Anoep, and Dick Epema. Parallel and Distributed Systems Group, TU Delft. ACM/IEEE Int’l. Symposium on High Performance Distributed Computing. - PowerPoint PPT PresentationTRANSCRIPT
The Performance of Bags-Of-Tasks in Large-Scale Distributed Computing Systems
Alexandru Iosup, Ozan Sonmez, Shanny Anoep, and Dick Epema
ACM/IEEE Int’l. Symposium on High Performance Distributed Computing
Parallel and Distributed Systems Group, TU Delft
2
The VL-e project
• A grid project in the Netherlands (2004-)
• Natural gas money: VL-e 45 MEuro / 800 MEuro total research package
• Overall aim: … to design and build a virtual lab for
(digitally) enhanced science (e-science) experiments (no in-vivo or in-vitro, but in-silico experiments).
• Goals:1. create prototypes of application-specific e-science
environments
2. design and develop re-usable ICT/grid components
3. validate with real-life applications in testbeds
Natural gas price →
$$ for grid computing
3
The VL-e project: application areas
Grid ServicesHarness multi-domain distributed resources
Managementof comm. & computing
Virtual Laboratory (VL)Application Oriented Services
Data Intensive Science
Bio-Diversity
Bio-Informatics
Food Informatics
Medical Diagnosis &
Imaging
Dutch Telescience
Philips UnileverIBM
4
The VL-e project: application areas
Grid ServicesHarness multi-domain distributed resources
Managementof comm. & computing
Virtual Laboratory (VL)Application Oriented Services
Data Intensive Science
Bio-Diversity
Bio-Informatics
Food Informatics
Medical Diagnosis &
Imaging
Dutch Telescience
Philips UnileverIBM
Bags-of-Tasks
5
The VL-e project: application areas
Grid ServicesHarness multi-domain distributed resources
Managementof comm. & computing
Virtual Laboratory (VL)Application Oriented Services
Data Intensive Science
Bio-Diversity
Bio-Informatics
Food Informatics
Medical Diagnosis &
Imaging
Dutch Telescience
Philips UnileverIBM
Bags-of-Tasks
6
The Challenge
• Complete scientific work better, … • User-oriented performance metrics
(time a critical performance component)• Bags-of-tasks for ease-of-use
• … in real systems• Workloads (now that real traces are available)• Information unavailability
• What to do?• Hint: the next 10% improvement won’t cut it!
7
The Challenge (cont’d.)
• System modelWhat is a good model for the study of large-scale distributed computing systems that run bag-of-tasks?
• Input modelWhat is a good model for bag-of-tasks workloads in large-scale distributed computing systems?
• What is the best setup for such system/input?• How to find the best?• If a best is found, can there be another?
8
The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems
1. Introduction and Motivation 2. Context: System Model3. Workload Model4. Design Space Exploration5. Conclusion
9
Context: System Model [1/4]
Overview
• System Model1. Clusters
execute jobs
2. Resource managerscoordinate job execution
3. Resource management architecturesroute jobs among resource managers
4. Task selection policiescreate the eligible set
5. Task scheduling policies:schedule the eligible set
10
Context: System Model [2/4]
Resource Management Architecturesroute jobs among resource managers
Separated Clusters (sep-c)
Centralized (csp)
Decentralized (fcondor)
11
Context: System Model [3/4]
Task Selection Policiescreate the eligible set
• Age-based:1. S-T: Select Tasks in the order of their arrival.
2. S-BoT: Select BoTs in the order of their arrival.
• User priority based:3. S-U-Prio: Select the tasks of the User with the highest
Priority.
• Based on fairness in resource consumption:4. S-U-T: Select the Tasks of the User with the lowest res. cons.
5. S-U-BoT: Select the BoTs of the User with the lowest res. cons.
6. S-U-GRR: Select the User Round-Robin/all tasks for this user.
7. S-U-RR: Select the User Round-Robin/one task for this user.
12
Context: System Model [4/4]
Task Scheduling Policiesschedule the eligible set
• Information availability:• Known• Unknown• Historical records
• Sample policies:• Earliest Completion Time (with
Prediction of Runtimes) (ECT(-P))• Fastest Processor First (FPF)• (Dynamic) Fastest Processor Largest Task ((D)FPLT)• Shortest Task First w/ Replication (STFR) • Work Queue w/ Replication (WQR)
Task Information
Reso
urc
e
Info
rmati
on
K H U
K
H
U
ECT, FPLT
FPFECT-P
DFPLT,
MQDSTFR
RR, WQR
13
The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems
1. Introduction and Motivation 2. Context: System Model3. Workload Model4. Design Space Exploration5. Conclusion
14
Workload Modeling 101: What Matters• Job arrival process & job service time:
• Self-similarity (burstiness) vs. Poisson [Leland & Ott ToN’94]
• Job grouping: bags-of-tasks dominant application type in multi-cluster grids and cycle-scavenging systems (the e-Science infrastructure) [IosupJSE EuroPar’07]
• Job size: almost always 1 CPU [IosupDELW Grid’06]
No.
Pac
kets
/T
ime
Uni
tN
o.P
acke
ts/
Tim
e U
nit
Time Units Time Units
Longer queues
TimeUnit=
0.01s
TimeUnit=
100s
15
• Model:• Users, Bags-of-Tasks, Tasks• Heavy-tailed distributions for inter-arrival time, job
service time→ can model self-similar workloads
• More details (e.g., parameter values): see article
• Validation data: the Grid Workloads Archive• 7 long-term grid traces• >5 million tasks• >2500 users• >40k CPUs• Domains: HEP, graphics, AI, math, biomed, climate,
finance, aero…
A Bag-of-Tasks Workload Model
http://gwa.ewi.tudelft.nl/
16
The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems
1. Introduction and Motivation 2. Context: System Model3. Workload Model4. Design Space Exploration5. Conclusion
17
Design Space Exploration [1/5]
Overview
• Design space exploration: time to understand how our solutions fit into the complete system.
• Study the impact of:• The Task Scheduling Policy (s policies)• The Workload Characteristics (P characteristics)• The Dynamic System Information (I levels)• The Task Selection Policy (S policies)• The Resource Management Architecture (A policies)
s x 7P x I x S x A x (environment) → >2M design points
18
Design Space Exploration [2/5]
Experimental Setup
• Simulator: • DGSim [IosupETFL SC’07, IosupSE EuroPar’08]
• System:• DAS + Grid’5000 [Cappello & Bal CCGrid’07]• >3,000 CPUs: relative perf. 1-1.75
• Metrics:• Makespan• Normalized Schedule Length ~ speed-up
• Workloads:• Real: DAS + Grid’5000• Realistic: system load 20-95% (from workload
model)
19
Design Space Exploration [3/5] Selected Results A
Design Guidelines for Scheduling Policies
• Influence of the information type:• (K,K): best balance between MS and NSL• (*,U),(U,*): surprisingly good (FPF) to surprisingly poor
(WQR4x)
• (*,H),(H,*): poor. Simple runtime predictors don’t work (see article)
• Where to invest time? • K -> H, K-> U: adapt for information type with lowest
variationWQR4x
FPF
20
Design Space Exploration [4/5] Selected Results B
Task Selection Only for Busy Systems• Not much difference until system load over
50%.• For DAS + Grid’5000 no change of task selection policy.
Same performanc
e
S-BoT
S-T
21
Design Space Exploration [5/5] Selected Results C
Resource Management Architecture• Centralized, separated, or distributed?
• Centralized is best [Note: job overhead not considered.]• Distributed: good for system load below 50%;
over 50% it does not finish all tasks.
22
The Performance of Bags-of-Tasks in Large-Scale Distributed Computing Systems
1. Introduction and Motivation 2. Context: System Model3. Workload Model4. Design Space Exploration5. Conclusion
• System Model = Resource Management
Architecture +
Task Selection Policy +
Task Scheduling Policy
• Information availability framework
• BoT workload model
• Design space exploration:
the performance of bags-of-tasks
Conclusion
• Better predictors
• (H,H) task scheduling policies
Task Information
Reso
urc
e
Info
rmati
on
K H U
K
H
U
ECT, FPLT
FPFECT-P
DFPLT,
MQDSTFR
RR, WQR
Future Work ?
24
Thank you! Questions? Remarks? Observations?
Help building the Grid Workloads Archive:
http://gwa.ewi.tudelft.nl
• Contact: [email protected] [google “Iosup“]
• Web sites:o http://www.vl-e.nl : VL-e project
o http://www.pds.ewi.tudelft.nl : PDS group articles & software