scheduling in distributed systems - andrii vozniuk
DESCRIPTION
My EPFL candidacy exam presentation: http://wiki.epfl.ch/edicpublic/documents/Candidacy%20exam/vozniuk_andrii_candidacy_writeup.pdf Here I present how schedulers work in three distributed data processing systems and their possible optimizations. I consider Gamma - a parallel database, MapReduce - a data-intensive system and Condor - a compute-intensive system. This talk is based on the following papers: 1) Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt 2) Improving MapReduce performance in heterogeneous environments by Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz and Ion Stoica 3) Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWittTRANSCRIPT
![Page 1: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/1.jpg)
Scheduling In Distributed Systems
Andrii Vozniuk EPFL July 4, 2012
Candidacy exam
![Page 2: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/2.jpg)
Data explosion Processing gets more complicated
Big Data
2
Resources of many computers should be used
Generates: 40 TB/dayStores: 20 PB/year
Generates: 25 TB/dayStores: 10 PB/year
![Page 3: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/3.jpg)
Typical Data Processing Pipeline
3
Log data
Clean data
Query data
Analyze data
Sensor data
No one-size-fits-all system currently exists
ETL-like batchprocessing
Efficient queryexecution
Using resources ofmany organizations
Particle found!
User model
![Page 4: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/4.jpg)
Outline
4
Condor - compute-intensive system
Gamma - parallel database
MapReduce - data-intensive system
Ɣ
Conclusions
Future Research
![Page 5: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/5.jpg)
Scheduling Policy: setting an ordering of tasks Assigning resources to tasks
Scheduling In Distributed Systems
5
Scheduling is challenging in distributed systems
How to match resources and tasks?
tasktask
task task
![Page 6: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/6.jpg)
Matching Tasks With Resources
6
How scheduling is influenced by data and execution models?
Perspectives Data model Execution model
System/Perspective
Data model Execution model
Gamma Relational Multioperator
MapReduce Unconstrained MapReduce
Condor Unconstrained Unconstrained
![Page 7: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/7.jpg)
Gamma Pioneering parallel database Data model: constrained
Relational data model Relations are horizontally partitioned
Execution model: constrained Multioperator queries Operators employ hash-based algorithms
7
Ɣ
![Page 8: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/8.jpg)
Gamma: Scheduler
8
Scheduling is done at the operator level
Host Machine
Query Manager Catalog
Gamma Database
a-m n-z
OperatorProcess
Operator Process
SchedulerProcess
Optimizes queryCompiles plan
SELECT r FROM RWHERE r < ‘k’
Schedulesoperators
Execution onrelevant nodes
Node 1 Node 2
Ɣ
query
![Page 9: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/9.jpg)
Gamma: Batch Scheduling
9
Exploit sharing by scheduling in a batch Example of selection sharing
Batch scheduling trades latency for throughput
σ1
A
Shared scanσ2
A
σ1
A
σ2
Ɣ
Reads of A can be shared applying predicates in turn Shared relation A is scanned only once
![Page 10: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/10.jpg)
Gamma: Batch Scheduling Joins
10
Sharing among joins reduces total execution time
Several hash-joins in a batch of queries Hash table for the same relation can be shared Example assumes 100% selectivity of σ
σ σ
⋈
A Β
σ σ
⋈
A C
σ σ
⋈
B A
σ
C
⋈
Shared hash-table for A
Ɣ
Sharing reduces I/O and memory usage
![Page 11: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/11.jpg)
Limitations Of Gamma Gamma offers
Efficient query execution Sharing in a batch of queries
Gamma operates on structured data Gamma is not suitable for
Unstructured data processing ETL type of workload Running on large scale
11
A different system for ETL processing is needed
Ɣ
![Page 12: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/12.jpg)
MapReduce System for data-intensive applications Execution model: constrained
Job is a set of map and reduce tasks Tasks are independent
Data model: unconstrained Arbitrary data format Files are partitioned into chunks Each chunk is replicated several times
12
![Page 13: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/13.jpg)
MapReduce: Scheduling
13
4 Map tasks
2 Reduce task
MapReduce job
Map1
Map2
Map3
Map4
Reduce
Reduce
Example:
Temp1Result
1
Chunk1
Temp2
Chunk2
Temp4Temp3
Chunk3 Result
2
Chunk4
Fine grain scheduling improves fault tolerance and elasticity
Tasks are scheduled close to data Execution is scalable and fault-tolerant Execution is elastic
![Page 14: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/14.jpg)
MapReduce: Speculative Execution Nodes may become slow Speculative execution minimizes job’s response time Launch if progress is 20% less than average
14
Speculative execution works well in homogeneous environment
Normal node
Temporary slow node
backup
straggler
![Page 15: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/15.jpg)
Emerging Heterogeneous Infrastructures Replacement of failed components Extending existing cluster with new machines Virtualized data centers of cloud providers
CPU and RAM are isolated Contention for disk and network
15
In many real-life cases the infrastructure is heterogeneous
1 2 3 4 5 6 70
10203040506070
VMs on Physical Host
IO P
erfo
rman
ce p
er
VM (M
B/s)
![Page 16: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/16.jpg)
MapReduce: Heterogeneous Cluster
Performance degrades on heterogeneous cluster Slow nodes are wasted Backup tasks on slow nodes All straggling tasks are treated equally Thrashing due to excessive speculative execution
16
Speculative execution should be improved for heterogeneous cluster
Fast node
Slow node
![Page 17: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/17.jpg)
MapReduce: LATE Scheduler Idea: back up the task with the largest estimated finish
time (Longest Approximate Time to End)
Thresholds Limit the number of backup tasks Launch backup tasks on fast nodes Backup only sufficiently slow tasks
17
progress score
execution timeprogress rate =
1 – progress score
progress rateestimated time left =
LATE looks forward to prioritize tasks to speculate
![Page 18: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/18.jpg)
MapReduce: LATE Example Back up the task with Longest Approximate Time to End
18
2 min
Time (min)
Progress = 5.3%
Estimated time left:(1-0.05) / (1/1.9) = 1.8
Progress = 66%
LATE correctly identifies task which hurts the response time the most
3x slower
1.9x slower
1
2
3
1 task/min
Estimated time left:(1-0.66) / (1/3) = 1
improvement
![Page 19: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/19.jpg)
Limitations Of MapReduce
19
MapReduce offers High scalability Good fault tolerance Handling of unstructured data
MapReduce is not suitable for Running on multi organization infrastructure Harvesting idle resources in organization
A different system for multi organization infrastructure is needed
![Page 20: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/20.jpg)
Compute-intensive system harvesting idle resources Data model: arbitrary Execution model: arbitrary
jobjob
Condor
20
job
job
Increase resources utilization by scheduling jobs on idle machines
How to increase utilization and respect
the owners?
![Page 21: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/21.jpg)
jobjob
Condor Scheduler: Centralized?
21
Scheduler
Efficient but not reliable, possible bottleneck
job
job
![Page 22: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/22.jpg)
jobjob
Condor Scheduler: Distributed?
22
Scheduler
Reliable but inefficient
Scheduler
Scheduler
Scheduler
job
job
![Page 23: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/23.jpg)
jobjob
Condor Scheduler: Hybrid!
23
Hybrid approach has the best of both worlds
Scheduler
Scheduler
Matchmaker
Scheduler
Information about nodesInformation about tasks
job
job
1
1
12
3
3
4
![Page 24: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/24.jpg)
Job Description
[MyType=“Job”TargetType = “Machine“Department=“CompSci“Requirements = (other.OpSys==LINUX && other.Disk > 10000000)Rank=Memory]
Machine Description
[MyType=“Machine“TargetType=“Job“Machine=“nostos.cs.wisc.edu“OpSys=“LINUX“Disk=3076077Requirement = (LoadAvg <= 0.3) && (KeyboardIdle > (15*60))Rank = other.Department==self.Department]
ClassAds: Describing Jobs and Resources
24
Matchmaker is suitable for heterogeneous shared clusters
Requirements should be satisfied Candidate with the highest rank is returned
![Page 25: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/25.jpg)
Scheduling done at different levels Gamma: operator level scheduling enables sharing MR and Condor: arbitrary code => sharing is hard Condor: matchmaking gives control on job placement
Hybrid approaches are promising for big data processing
Scheduling in heterogeneous deployments is challenging
Conclusions
25
![Page 27: Scheduling in distributed systems - Andrii Vozniuk](https://reader035.vdocuments.mx/reader035/viewer/2022062312/556589b4d8b42a723f8b52a9/html5/thumbnails/27.jpg)
References Matchmaking: Distributed Resource Management for Hig
h Throughput Computing by Rajesh Raman, Miron Livny and Marvin Solomon.
Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt.
Improving MapReduce performance in heterogeneous environments by Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz and Ion Stoica
Slides 14 and 18 exploit presentation ideas from the LATE slides for OSDI 2008 by Matei Zaharia
27