heracles: improving resource efficiency at scale isca’15 stanford university google, inc
TRANSCRIPT
Heracles: Improving Resource Efficiency at ScaleISCA’15
Stanford UniversityGoogle, Inc.
OutlineIntroductionDesign
◦Isolation Mechanisms◦Controllers
EvaluationConclusion
MotivationAverage server utilization in most
datacenter is low, ranging between 10%~50%.◦Difficult to consolidate the latency-
critical services on a subset of highly utilized servers.
Increase the server utilization by launching best-effort tasks on the same server with a latency-critical job.
Motivation(Cont.)Previous works tend to protect LC
workloads, but reduce the opportunities for higher utilization through co-location.
GoalEliminate SLO violations at all
levels of load for the LC job while maximizing the throughput for BE tasks.
HeraclesA real-time, feedback-based
controller◦Enables the safe co-location of best-
effort(BE) tasks alongside a latency-critical(LC) service.
◦Ensures that LC jobs meet their target while maximizing the resources given to BE tasks.
Heracles(Cont.)◦Four hardware and software isolation
mechanisms. Hardware: shared cache partitioning,
fine-grained power/frequency setting. Software: core isolation, network traffic
control.
Isolation Mechanisms(Soft)Core isolation
◦Pin workload to a set of core using cpuset cgroups.
◦Speed of (re)allocation: tens of milliseconds.
Network traffic◦Limit the outgoing bandwidth of BE
tasks using Linux traffic control.◦No limit on LC job.◦Take effect in less than hundreds of
milliseconds.
Isolation Mechanisms(Hard)LLC isolation
◦Cache Allocation Technology(CAT) in recent Intel chip. Use way-partitioning to define non-
overlapping partitions on LLC. Take effect in a few milliseconds.
◦Implement software monitor to track the bandwidth usage of LC and BE jobs. Scale down the # of cores for BE jobs if LC
jobs does not receive sufficient bandwidth.
Isolation Mechanisms(Hard)(Cont.)Power isolation
◦CPU frequency monitoring, Running Average Power Limit(RAPL), and per-core DVFS.
◦Take effect within a few milliseconds.
Design ApproachAn optimization problem
◦Maximize utilization with the constraint that the SLO must be met.
Heracles ◦decomposes the high-dimensional
optimization problem into many smaller and independent problem. Decoupling interference sources.
◦Monitors latency, latency slack, and load. Adjust the BE job allocation.
System Diagram
High-level Controller
Core & Memory Sub-controller
Max Load under SLO
Power and Network Sub-controller
EvaluationTwo sets of experiments
◦Co-locates LC applications with BE tasks on a single server.
◦Measuring end-to-end latency of Websearch on tens of servers. BE tasks are also running.
Effective Machine Utilization(EMU)◦LC throughput + BE throughput
WorkloadsThree Google production LC
workloads:◦websearch◦ml_cluster
Real-time text clustering using machine learning
◦memkeyval In-memory key-value store
Run LC workloads with benchmarks that stress a single shared resource.◦Stream-LLC, Stream-DRAM, cpu-pwr, iperf, brain, and streetview.
Latency of LC Applications
EMU
Shared Resource Utilization
Websearch in Cluster
ConclusionHeracles
◦a heuristic feedback-based system that manages four isolation mechanisms to enable a latency-critical workload to be co-located with batch jobs without SLO violations.
◦Evaluation on real hardware demonstrates an average utilization of 90% across all evaluated scenarios without any SLO violations for the latency-critical job.