improving resource-efficiency in cloud computing · in cloud computing christina delimitrou and...

Improving Resource-Efficiency in Cloud Computing Christina Delimitrou and Christos Kozyrakis Stanford University Unpredictability in resource allocation overprovisioning Analytical framework based on sampling Strict guarantees on predictability of resource allocation Accounts for app preferences, load changes, etc. Simplifies user billing: based on used resources + strictness of QoS constraints (function only of sampled resources) Greedy selection Analytical Framework Improving Performance Predictability Insight: Move the responsibility of determining the right amount/type of resources to the cluster manager Key ideas: 1. User specifies a performance target 2. Leverage the system’s existing knowledge to minimize overheads app classification 3. Account for allocation and assignment jointly Profiling: sparse input signal to classification Parallel classifications: translate performance to scale-up, scale-out, heterogeneity and interference similarities between old and new workloads Greedy selection: select the min amount of resources that satisfy the performance target Total overheads: a few sec (up to 2 min for apps with long setup times) Quasar monitors performance and adjusts allocation at runtime Execution in isolation Scale-out or scale-up Complete rescheduling Quasar: Resource-Efficient Cluster Management □ Datacenters greatly underutilized (25-35% for Google, 6-8% for Amazon EC2) prevents scalability, increases cost Overprovisioned reservations Workload interference Heterogeneous platforms Insufficient info on apps Utilization on a large production cluster at Twitter: Collected over 1 month Running over the Mesos scheduling system Serving latency-critical, user-facing workloads Is underutilization the cluster manager’s fault? Conservative users: Reservations exceed usage by 5-10x The cluster appears to be running close to capacity Determining the right amount/type of resources is hard! Datacenter Underutilization On-going/Future Work Gain Google 80% of servers < 20% utilization Gain PDF 0.1 0.95 x P(x) Quasar QoS target Profiling runs … A. Two stateful latency-critical apps + best-effort workloads B. General cloud provider case (1200 apps, 200 EC2 servers) Quasar Evaluation Quasar Twitter Cores Performance Scale-up Cores Performance Heterogeneity Servers Performance Scale-out Input size Performance Input load Cores Performance Interference

Upload: others

Post on 08-Aug-2020

3 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Improving Resource-Efficiency in Cloud Computing

Christina Delimitrou and Christos Kozyrakis Stanford University

Unpredictability in resource allocation overprovisioning

Analytical framework based on sampling Strict guarantees on predictability of resource allocation

Accounts for app preferences, load changes, etc.

Simplifies user billing: based on used resources + strictness of QoS constraints (function only of sampled resources)

Greedy selection Analytical Framework

Improving Performance Predictability

Insight: Move the responsibility of determining the right amount/type of resources to the cluster manager

Key ideas:

1. User specifies a performance target

2. Leverage the system’s existing knowledge to minimize overheads app classification

3. Account for allocation and assignment jointly

Profiling: sparse input signal

to classification

Parallel classifications: translate

performance to scale-up,

scale-out, heterogeneity and

interference similarities

between old and new workloads

Greedy selection: select the min

amount of resources that

satisfy the performance target

Total overheads: a few sec (up to 2 min for apps with long setup times)

Quasar monitors performance and adjusts allocation at runtime Execution in isolation

Scale-out or scale-up

Complete rescheduling

Quasar: Resource-Efficient Cluster Management

□ Datacenters greatly underutilized (25-35% for Google,

6-8% for Amazon EC2) prevents scalability, increases cost

Overprovisioned reservations

Workload interference

Heterogeneous platforms

Insufficient info on apps

Utilization on a large production cluster at Twitter:

Collected over 1 month

Running over the Mesos scheduling system

Serving latency-critical, user-facing workloads

Is underutilization the cluster manager’s fault?

Conservative users: Reservations exceed usage by 5-10x

The cluster appears to be running close to capacity

Determining the right amount/type of resources is hard!

Datacenter Underutilization

On-going/Future Work

Gain

Google

80% of servers

< 20% utilization

Gain PDF

0.1 0.95 x

P(x)

Quasar QoS target

Profiling

runs

…

A. Two stateful latency-critical apps + best-effort workloads

B. General cloud provider case (1200 apps, 200 EC2 servers)

Quasar Evaluation

Quasar

Twitter

Cores

Perf

orm

anc

e Scale-up

Cores

Perf

orm

anc

e Heterogeneity

Servers

Perf

orm

anc

e Scale-out

Input size

Perf

orm

anc

e Input load

Cores

Perf

orm

anc

e Interference

An Open-Source Benchmark Suite for Microservices …delimitrou/papers/2019...An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud & Edge

DRAF: A Low-Power DRAM-based Reconfigurable Acceleration ...delimitrou/slides/2016.isca.draf.slides.p… · DRAF in a Nutshell ! A high-density & low-power FPGA o Bit-level reconfigurable,

X-Containers: Breaking Down Barriers to Improve ...delimitrou/slides/... · Linux Kernel namespaces cgroups SELinux Container Shared kernel attack s s surface and TCB Not allowed

Overprov a tool for cluster overprovisioning detection

IBM FlashSystem 720 and IBM FlashSystem 820 · z FlashSystem 820 can scale up to 24 TB ... z FlashSystem storage systems use overprovisioning and advanced ... IBM FlashSystem 720

qSim: Scalable and Validated Simulation of Cloud MicroservicesMicroservices Yanqi Zhang, Yu Gan, and Christina Delimitrou Electrical and Computer Engineering Department Cornell University

Hybrid SDN Architecture for Resource Consolidation in MPLS ...pure.au.dk/portal/files/112681554/AKatov_WTS_2015_2_26.pdf · large network, with a substantial resource overprovisioning

Mage: Online and Interference-Aware Scheduling for Multi-Scale ...delimitrou/papers/2018.pact.mage.pdf · or by partitioning resources to improve isolation [20,31,32, 39]. For example,

Seer: Leveraging Big Data to Navigate The Increasing Complexity … · 2019-12-18 · Yu Gan, MeghnaPancholi, DailunCheng, SiyuanHu, Yuan He and Christina Delimitrou Cornell University

Unveiling the Hardware and Software Implications of ...delimitrou/papers/2020.top...Java and Javascript. The back-end databases use memcachedandMongoDBinstances. IoTSwarmCoordination:Finally,weexplorean

Exascale Computing - hlrs.de · Exascale Computing 4 Exascale Computing The shift from petascale computing to exascale computing—a thousandfold increase in computing power—constitutes

11-2-2014 Challenge the future Delft University of Technology Overprovisioning for Performance Consistency in Grids Nezih Yigitbasi and Dick Epema Parallel

Time and Cost-Efficient Modeling and Generation of …delimitrou/papers/TPCTC2011.pdf · Time and Cost-Efficient Modeling and Generation of Large ... for a significant portion of

Quantifying Overprovisioning vs. Class-of-Service: Informing the Net Neutrality Debate Murat Yuksel (University of Nevada – Reno) [email protected] K

Be your own chief cloud officer | Dell...• Achieve hyper-fast scale when needed using a public-cloud infrastructure. • Minimize costs by avoiding overprovisioning. The new solution

Machine Learning for Computer Systemsiacoma.cs.uiuc.edu/mcat/ml.pdfRecommender Systems Delimitrou and Kozyrakis ASPLOS 2013,2014 13 Performance Power configuration: processor type

Ripple: A Practical Declarative Programming Framework for ...delimitrou/papers/2020.arxiv.ripple.pdf · App converted to serverless pipeline Figure 1: Overview of Ripple’s operation,

IOT / BLOCKCHAIN / EDGE COMPUTING / COGNITIVE COMPUTING ... · INTERNET OF THINGS EDGE COMPUTING BLOCKCHAIN QUANTUM COMPUTING IOT / BLOCKCHAIN / EDGE COMPUTING / COGNITIVE COMPUTING

Seer: Leveraging Big Data to Navigate the Complexity of ...delimitrou/papers/2019.asplos.seer.pdf · individual microservice is much stricter than for traditional cloud applications

Practical Resource Management in Power-Constrained, High ...dkl/Publications/Papers/hpdc15.pdfderives job-level power bounds in a fair-share manner and supports overprovisioning and

X-Containers: Breaking Down Barriers to Improve ...delimitrou/papers/2019.asplos.xcontainer.pdf · container since it is shared by other containers. There have been several proposals

Overprovisioning for Performance Consistency in Grids

Computing Resource Paradigms CS3353. Computing Resource Paradigms Centralized Computing Distributed Computing

Intel® Rack Scale Design Architecture White Paper · Higher efficiency due to better resource utilization, reduced overprovisioning and dynamic workload tuning, all leading to lower

AN OPEN-SOURCE BENCHMARK S MICROSERVICES AND THEIR … › ~delimitrou › slides › ... · AN OPEN-SOURCE BENCHMARK SUITE FOR MICROSERVICES AND THEIR HARDWARE-SOFTWARE IMPLICATIONS

Cloud Computing Cloud Computing Overview of Distributed Computing

Optimizing Resource Provisioning in Shared Cloud …delimitrou/papers/2014.techreport.hybrid.pdfOptimizing Resource Provisioning in Shared Cloud Systems Abstract Cloud computing promises

Leveraging Approximation to Improve IMPROVING RESOURCE ...delimitrou/slides/2017.wax.pliant.slides.… · Leveraging Approximation to Improve Resource Efficiency in the Cloud. 2 Datacenter

AN OPEN-SOURCE BENCHMARK S MICROSERVICES AND …delimitrou/slides/2019.asplos.microservices.slides.pdfServerless frameworks •Compared long-running microservices on EC2 with short–

Andrzej Jakowski, Armoun Forghan Apr 2017 Santa Clara, CA...Cost reduction by less overprovisioning Improved deployment with more predictable & less varied latencies Cephcluster performance:

qSim: Enabling Accurate and Scalable Simulation for ...delimitrou/papers/2019.ispass.qsim.pdf · qSim is an event-driven simulator, and uses a ... evaluate two use cases for interactive

Cisco Cloud and Managed Services for Data Center ......Cloud and Managed Services (CMS) proactively monitors the overprovisioning levels and keeps track of the allocation and utilization

Uncovering the Security Implications of Cloud Multi ...delimitrou/papers/2018.toppicks.bolt.pdf · IEEE MICRO resource efficiency, both private and public clouds employ multi-tenancy,

PARTIES: QoS-Aware Resource Partitioning for Multiple ...delimitrou/papers/2019.asplos.parties.pdf · each co-scheduled workload, and maximizes throughput for the machine. We evaluate

An Open-Source Benchmark Suite for Microservices and Their ... - Cornell …delimitrou/papers/2019.asplos.micro... · 2019-02-05 · An Open-Source Benchmark Suite for Microservices