apollo : scalable and coordinated scheduling for cloud-scale computing 72150263 심윤석

APOLLO :SCALABLE AND COORDINATED SCHEDUL-ING FOR CLOUD-SCALE COMPUTING

72150263 심윤석

INDEX

• Backgroud

• Goals & Challenges of Apollo

• Apollo Framework

• Evaluation

• Conclusion

BACKGROUDJob

StageTask

SCOPECompile

DAG (Directed acyclic graph)

150 DOG

BACKGROUD

GOALS & CHALLENGES

• Minimize Job Latency & Maximize Cluster Utilization

• Challenges• Scaling

• Heterogeneous workload

• Maximize Resource Utilization

GOALS & CHALLENGES

• Scale• Job processes had GB to PB of data

• 100,000 scheduling request/sec (in peak time)

• Clusters contain over 20,000 servers

• Clusters run up to 170,000 tasks in parallel

GOALS & CHALLENGES

• Heterogeneous workload• Short (Seconds) & Long (Hours) Execution Time

• I/O bound, CPU bound

• Various Resource Requirements (e.g. Memory, Cores)

• Data Locality (Long Task) & Scheduling Latency (Short Task)

GOALS & CHALLENGES

• Maximize Utilization• Workload Fluctuates Regularly

• Especially CPU Utilization

APOLLO FRAMEWORK

APOLLO FRAMEWORKDistributed and Coordinate Scheduler

APOLLO FRAMEWORK

EstimationBased

Scheduling

APOLLO FRAMEWORK

Wait-Time Update

APOLLO FRAMEWORK

• Wait-Time Matrix• For represent server load

• Lightweight

• Expected Wait Time

• Future Resource Availability

APOLLO FRAMEWORK• Estimation-Based Scheduling

• For Minimize Task Completion Time

• Stable match algorithm

• Task Completion Time Equation

• E Estimated Task Comple-tion TimeI Initialization TimeW Wait TimeR Runtime

• Include Server Failure Cost

• C Final Estimated Completion TimeP Success ProbabilityK Server Failure Panalty

𝐸=𝐼+𝑊+𝑅 𝐶=𝑃𝑠𝑢𝑐𝑐𝐸+𝐾 (1−𝑃 𝑠𝑢𝑐𝑐 )𝐸

APOLLO FRAMEWORK

• Distributed and Coordinate Scheduler• One scheduler per one job

• Each scheduler make Independent Decisionbased on Global Status

• Conflicts can be occur

APOLLO FRAMEWORK

• Correcting Conflicts (Correction Machanism)• Re-evaluates prior scheduling decisions• Duplicate Scheduling• Confidence

• Scattering completion time• Randomization

APOLLO FRAMEWORK

• Opportunistic Scheduling• Maximize Utilization

• Random Scheduling Fairness

• Opportunistic Task • Can be preempted

• Can be upgrade to regular task

• Only consume idle resources

Opportunistic Task can useif Regular Task does not exist

EVALUATION

• Apollo at Sacle

• Scheduling Quality

• Evaluating Estimates Completion Time

• Correction Effectiveness

• Stable matching Efficiency

EVALUATION

• Apollo at Scale• Run 170,000 tasks in parallel

• Tracks 14,000,000 pending tasks

• Well utilized in weekday(90% median CPU utilization)

EVALUATION

• Scheduling Quality• 80% of Recurring jobs

getting faster

• Significantly improvedwait time

• Similar performance with Oracle (No schedule latency, conflicts, failure …)

EVALUATION

• Evaluating Estimates Completion Time

EVALUATION

• Correction Effectiveness

• 82% Success rate

• < 0.5% Trigger rate

• Stable matching Efficiency

CONCLUSION

• Minimize Job Latency • Loosely Coordinated Distributed Scheduler

• High Quality Scheduling

• Maximize Cluster Utilization• Opportunistic Scheduling

REFERENCE

• https://www.usenix.org/conference/osdi14/technical-ses-sions/presentation/boutin

• https://www.usenix.org/sites/default/files/conference/pro-tected-files/osdi14_slides_boutin.pdf

apollo : scalable and coordinated scheduling for cloud-scale computing 72150263 심윤석

Documents