scheduling in distributed systems - andrii vozniuk

Scheduling In Distributed Systems

Andrii Vozniuk EPFL July 4, 2012

Candidacy exam

http://personnes.epfl.ch/210961



http://www.epfl.ch/

Data explosion Processing gets more complicated

Big Data

2

Resources of many computers should be used

Generates: 40 TB/dayStores: 20 PB/year

Generates: 25 TB/dayStores: 10 PB/year

Typical Data Processing Pipeline

3

Log data

Clean data

Query data

Analyze data

Sensor data

No one-size-fits-all system currently exists

ETL-like batchprocessing

Efficient queryexecution

Using resources ofmany organizations

Particle found!

User model

Outline

4

Condor - compute-intensive system

Gamma - parallel database

MapReduce - data-intensive system

Ɣ

Conclusions

Future Research

Scheduling Policy: setting an ordering of tasks Assigning resources to tasks

Scheduling In Distributed Systems

5

Scheduling is challenging in distributed systems

How to match resources and tasks?

tasktask

task task

Matching Tasks With Resources

6

How scheduling is influenced by data and execution models?

Perspectives Data model Execution model

System/Perspective

Data model Execution model

Gamma Relational Multioperator

MapReduce Unconstrained MapReduce

Condor Unconstrained Unconstrained

Gamma Pioneering parallel database Data model: constrained

Relational data model Relations are horizontally partitioned

Execution model: constrained Multioperator queries Operators employ hash-based algorithms

7

Ɣ

Gamma: Scheduler

8

Scheduling is done at the operator level

Host Machine

Query Manager Catalog

Gamma Database

a-m n-z

OperatorProcess

Operator Process

SchedulerProcess

Optimizes queryCompiles plan

SELECT r FROM RWHERE r < ‘k’

Schedulesoperators

Execution onrelevant nodes

Node 1 Node 2

Ɣ

query

Gamma: Batch Scheduling

9

Exploit sharing by scheduling in a batch Example of selection sharing

Batch scheduling trades latency for throughput

σ1

A

Shared scanσ2

A

σ1

A

σ2

Ɣ

Reads of A can be shared applying predicates in turn Shared relation A is scanned only once

Gamma: Batch Scheduling Joins

10

Sharing among joins reduces total execution time

Several hash-joins in a batch of queries Hash table for the same relation can be shared Example assumes 100% selectivity of σ

σ σ

⋈

A Β

σ σ

⋈

A C

σ σ

⋈

B A

σ

C

⋈

Shared hash-table for A

Ɣ

Sharing reduces I/O and memory usage

Limitations Of Gamma Gamma offers

Efficient query execution Sharing in a batch of queries

Gamma operates on structured data Gamma is not suitable for

Unstructured data processing ETL type of workload Running on large scale

11

A different system for ETL processing is needed

Ɣ

MapReduce System for data-intensive applications Execution model: constrained

Job is a set of map and reduce tasks Tasks are independent

Data model: unconstrained Arbitrary data format Files are partitioned into chunks Each chunk is replicated several times

12

MapReduce: Scheduling

13

4 Map tasks

2 Reduce task

MapReduce job

Map1

Map2

Map3

Map4

Reduce

Reduce

Example:

Temp1Result

1

Chunk1

Temp2

Chunk2

Temp4Temp3

Chunk3 Result

2

Chunk4

Fine grain scheduling improves fault tolerance and elasticity

Tasks are scheduled close to data Execution is scalable and fault-tolerant Execution is elastic

MapReduce: Speculative Execution Nodes may become slow Speculative execution minimizes job’s response time Launch if progress is 20% less than average

14

Speculative execution works well in homogeneous environment

Normal node

Temporary slow node

backup

straggler

Emerging Heterogeneous Infrastructures Replacement of failed components Extending existing cluster with new machines Virtualized data centers of cloud providers

CPU and RAM are isolated Contention for disk and network

15

In many real-life cases the infrastructure is heterogeneous

1 2 3 4 5 6 70

10203040506070

VMs on Physical Host

IO P

erfo

rman

ce p

er

VM (M

B/s)

MapReduce: Heterogeneous Cluster

Performance degrades on heterogeneous cluster Slow nodes are wasted Backup tasks on slow nodes All straggling tasks are treated equally Thrashing due to excessive speculative execution

16

Speculative execution should be improved for heterogeneous cluster

Fast node

Slow node

MapReduce: LATE Scheduler Idea: back up the task with the largest estimated finish

time (Longest Approximate Time to End)

Thresholds Limit the number of backup tasks Launch backup tasks on fast nodes Backup only sufficiently slow tasks

17

progress score

execution timeprogress rate =

1 – progress score

progress rateestimated time left =

LATE looks forward to prioritize tasks to speculate

MapReduce: LATE Example Back up the task with Longest Approximate Time to End

18

2 min

Time (min)

Progress = 5.3%

Estimated time left:(1-0.05) / (1/1.9) = 1.8

Progress = 66%

LATE correctly identifies task which hurts the response time the most

3x slower

1.9x slower

1

2

3

1 task/min

Estimated time left:(1-0.66) / (1/3) = 1

improvement

Limitations Of MapReduce

19

MapReduce offers High scalability Good fault tolerance Handling of unstructured data

MapReduce is not suitable for Running on multi organization infrastructure Harvesting idle resources in organization

A different system for multi organization infrastructure is needed

Compute-intensive system harvesting idle resources Data model: arbitrary Execution model: arbitrary

jobjob

Condor

20

job

job

Increase resources utilization by scheduling jobs on idle machines

How to increase utilization and respect

the owners?

jobjob

Condor Scheduler: Centralized?

21

Scheduler

Efficient but not reliable, possible bottleneck

job

job

jobjob

Condor Scheduler: Distributed?

22

Scheduler

Reliable but inefficient

Scheduler

Scheduler

Scheduler

job

job

jobjob

Condor Scheduler: Hybrid!

23

Hybrid approach has the best of both worlds

Scheduler

Scheduler

Matchmaker

Scheduler

Information about nodesInformation about tasks

job

job

1

1

12

3

3

4

Job Description

[MyType=“Job”TargetType = “Machine“Department=“CompSci“Requirements = (other.OpSys==LINUX && other.Disk > 10000000)Rank=Memory]

Machine Description

[MyType=“Machine“TargetType=“Job“Machine=“nostos.cs.wisc.edu“OpSys=“LINUX“Disk=3076077Requirement = (LoadAvg <= 0.3) && (KeyboardIdle > (15*60))Rank = other.Department==self.Department]

ClassAds: Describing Jobs and Resources

24

Matchmaker is suitable for heterogeneous shared clusters

Requirements should be satisfied Candidate with the highest rank is returned

Scheduling done at different levels Gamma: operator level scheduling enables sharing MR and Condor: arbitrary code => sharing is hard Condor: matchmaking gives control on job placement

Hybrid approaches are promising for big data processing

Scheduling in heterogeneous deployments is challenging

Conclusions

25

26

Thank you for your attention!

Feedback & Question?

[email protected]

References Matchmaking: Distributed Resource Management for Hig

h Throughput Computing by Rajesh Raman, Miron Livny and Marvin Solomon.

Batch Scheduling in Parallel Database Systems by Manish Mehta, Valery Soloviev and David J. DeWitt.

Improving MapReduce performance in heterogeneous environments by Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz and Ion Stoica

Slides 14 and 18 exploit presentation ideas from the LATE slides for OSDI 2008 by Matei Zaharia

27

http://research.cs.wisc.edu/condor/doc/hpdc98.pdf

http://research.cs.wisc.edu/condor/doc/hpdc98.pdf

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=344041&tag=1

http://static.usenix.org/event/osdi08/tech/full_papers/zaharia/zaharia.pdf



scheduling in distributed systems - andrii vozniuk

Technology

tasks tasks

slow tasks

partitioned execution

map tasks

parallel database data

matching tasks

independentdata model

arbitraryexecution model