heracles: improving resource efficiency at scale isca’15 stanford university google, inc

Heracles: Improving Resource Efficiency at ScaleISCA’15

Stanford UniversityGoogle, Inc.

OutlineIntroductionDesign

◦Isolation Mechanisms◦Controllers

EvaluationConclusion

MotivationAverage server utilization in most

datacenter is low, ranging between 10%~50%.◦Difficult to consolidate the latency-

critical services on a subset of highly utilized servers.

Increase the server utilization by launching best-effort tasks on the same server with a latency-critical job.

Motivation(Cont.)Previous works tend to protect LC

workloads, but reduce the opportunities for higher utilization through co-location.

GoalEliminate SLO violations at all

levels of load for the LC job while maximizing the throughput for BE tasks.

HeraclesA real-time, feedback-based

controller◦Enables the safe co-location of best-

effort(BE) tasks alongside a latency-critical(LC) service.

◦Ensures that LC jobs meet their target while maximizing the resources given to BE tasks.

Heracles(Cont.)◦Four hardware and software isolation

mechanisms. Hardware: shared cache partitioning,

fine-grained power/frequency setting. Software: core isolation, network traffic

control.

Isolation Mechanisms(Soft)Core isolation

◦Pin workload to a set of core using cpuset cgroups.

◦Speed of (re)allocation: tens of milliseconds.

Network traffic◦Limit the outgoing bandwidth of BE

tasks using Linux traffic control.◦No limit on LC job.◦Take effect in less than hundreds of

milliseconds.

Isolation Mechanisms(Hard)LLC isolation

◦Cache Allocation Technology(CAT) in recent Intel chip. Use way-partitioning to define non-

overlapping partitions on LLC. Take effect in a few milliseconds.

◦Implement software monitor to track the bandwidth usage of LC and BE jobs. Scale down the # of cores for BE jobs if LC

jobs does not receive sufficient bandwidth.

Isolation Mechanisms(Hard)(Cont.)Power isolation

◦CPU frequency monitoring, Running Average Power Limit(RAPL), and per-core DVFS.

◦Take effect within a few milliseconds.

Design ApproachAn optimization problem

◦Maximize utilization with the constraint that the SLO must be met.

Heracles ◦decomposes the high-dimensional

optimization problem into many smaller and independent problem. Decoupling interference sources.

◦Monitors latency, latency slack, and load. Adjust the BE job allocation.

System Diagram

High-level Controller

Core & Memory Sub-controller

Max Load under SLO

Power and Network Sub-controller

EvaluationTwo sets of experiments

◦Co-locates LC applications with BE tasks on a single server.

◦Measuring end-to-end latency of Websearch on tens of servers. BE tasks are also running.

Effective Machine Utilization(EMU)◦LC throughput + BE throughput

WorkloadsThree Google production LC

workloads:◦websearch◦ml_cluster

Real-time text clustering using machine learning

◦memkeyval In-memory key-value store

Run LC workloads with benchmarks that stress a single shared resource.◦Stream-LLC, Stream-DRAM, cpu-pwr, iperf, brain, and streetview.

Latency of LC Applications

Shared Resource Utilization

Websearch in Cluster

ConclusionHeracles

◦a heuristic feedback-based system that manages four isolation mechanisms to enable a latency-critical workload to be co-located with batch jobs without SLO violations.

◦Evaluation on real hardware demonstrates an average utilization of 90% across all evaluated scenarios without any SLO violations for the latency-critical job.

heracles: improving resource efficiency at scale isca’15 stanford university google, inc

Documents

lc jobs

latencycritical job

bandwidth usage of lc

monitors latency

latency slack

latencycritical workload

latencycritical services

effortbe tasks