containercon 2016: finding (and fixing!) performance anomalies in large scale distributed systems

34
Confidential + Proprietary Confidential + Proprietary Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems Victor Marmol [email protected]

Upload: victor-marmol

Post on 29-Jan-2018

232 views

Category:

Engineering


0 download

TRANSCRIPT

Page 1: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + ProprietaryConfidential + Proprietary

Finding (and Fixing!) Performance Anomalies in Large Scale Distributed SystemsVictor [email protected]

Page 2: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Today

App

? ? ?

Page 3: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Containers Infrastructure

Manage containers @ Google

Everything runs in a container

2B+ containers started per week

Images by Connie Zhou

Page 4: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

You may Know Some of our OSS Work

Let Me Contain That For You

Page 5: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

What about at Google?

Images by Connie Zhou

Page 6: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Borg

Page 7: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

What is Borg?

Large-scale cluster management at Google with Borg

Page 8: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Borglet

Google’s node agent

Borglet = init + Docker + a few other things

Primary goals

➔ Talk to master➔ Manage tasks➔ Manage resources (containers)

Page 9: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

How do we get to task performance management?

Dremel: Interactive Analysis of Web-Scale Datasets

Page 10: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Task Performance Analysis (TPA)

Our system for container-based black-box application performance analysis

Containers are the main enabler

Manage, monitor, and improve application performance

Today’s Talk

➔ How does it work➔ User stories: stories from the front-lines!

Container

App

Page 11: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + ProprietaryConfidential + Proprietary

How does it work?

Page 12: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Overall Flow

Collection → Aggregation → Baselines → SLOs → Enforcement

Page 13: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Low-Level Performance Metrics

Key: collect lots of container-based low-level metrics from the kernel

Custom kernel patches to give us even more stats and metrics

Sources➔ cgroups➔ /proc➔ perf_events➔ misc (e.g.: netlink, ioctls, etc)

Container

App

low-level performance metrics and telemetry

Collection → Aggregation → Baselines → SLOs → Enforcement

Page 14: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Low-Level Performance Metrics

Histograms are our favorite: number, breakdown, and tail of operations➔ CPU latencies➔ Memory reclaim, page faults, re-faults➔ I/O wait time and service time

Metrics collected every 1s - 10s➔ 1s: Used for on-machine control loops➔ 10s: Exported for off-machine analysis

Collection is very low-overhead

Collection → Aggregation → Baselines → SLOs → Enforcement

Page 15: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Cluster-Wide Aggregation

Cluster service that collects all metrics and exports them to Dremel

Push data for all tasks on all machines, keep them for a while

Single-handedly our most valuable resource➔ SQL is very expressive and flexible➔ Ability to query all that data in seconds: priceless

Best news: You can use it too! Google BigQuery

Performance Data DB

BigQuery

Collection → Aggregation → Baselines → SLOs → Enforcement

Page 16: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Performance Baselines

Cluster-level service: slice & dice data➔ Types of tasks➔ Distributions across replicas➔ Per compute cluster (Borg cell)➔ Historical trends

Gives us insights into performance trends and helps us develop performance baselines

Performance baseline: performance we can achieve given different parameters➔ CPU: How quickly can we schedule you on the CPU➔ Disk I/O: What disk I/O latency can we achieve

Collection → Aggregation → Baselines → SLOs → Enforcement

Page 17: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Baselines → SLOs

From baselines we provide performance SLOs:promise to the user

You promise to do X

➔ CPU: Use at most as much CPU as you asked for➔ Disk I/O: Issue less than X I/Os per second

We promise to give you Y performance

➔ CPU: You will get scheduled on a CPU within Yms of requesting it➔ Disk I/O: You will get I/O wait time of at most Yms

Collection → Aggregation → Baselines → SLOs → Enforcement

Page 18: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Enacting SLOs

Monitor SLOs closely and aggressively ensure they are met

Per-node➔ Give more resources or better quality resources➔ Throttle bad actors (antagonists)

Cluster-wide➔ Ask for help!➔ Move task to a different machine➔ Move antagonist to a different machine

Container

App

Container

App

Collection → Aggregation → Baselines → SLOs → Enforcement

Page 19: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Metrics➔ CPU➔ NUMA➔ Disk I/O

Page 20: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

CPU

Low-level metrics➔ Wakeup latency: time between

wanting to run and running➔ Round-robin latency: how well

you share CPU within your app➔ Load: how much work you

wanted to do➔ Time per state: how much time

your spent in each state (e.g.: sleep, wait, run, queue)

Page 21: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

CPU

SLOs➔ Wakeup latency when

well-behaved➔ CPU usage rate when

well-behaved

Page 22: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

NUMA

Low-level metrics➔ CPU locality: how much of your CPU (and

usage) was in local vs remote nodes➔ Memory locality: how much of your memory

(and accesses) was in local vs remote nodes

➔ NUMA score: resource-product of both above (0.0 - 1.0)

SLOs➔ NUMA score of 0.85 or above given certain

job shapes

The NUMA Experience

Page 23: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Disk I/O

Low-level metrics➔ Service time latency: time it took kernel to service request to disk➔ Wait time latency: time it took kernel to queue and service request

to disk➔ Queued: how much work you wanted to do➔ Usage: how much work did you actually did

SLOs➔ Small amount of disk time when well-behaved

Page 24: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + ProprietaryConfidential + Proprietary

User Stories

Page 25: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Performance Regression

User: VM environment

User Problem: … silence ...

SLO not met: CPU

Signal: CPU queue other

Root cause: Subtle, but expensive, new periodic operation

Make it better: Give the application more debug information

Page 26: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Performance Variation #1

User: Flight search

User Problem: QPS variation on some tasks

SLO not met: NUMA

Signal: CPU and memory locality

Root cause: Bad NUMA allocation by infrastructure

Make it better: Improve NUMA allocation

Page 27: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Performance Variation #2

User: Web search

User Problem: Latency variation on some task

SLO not met: CPI variation

Signal: CPI from perf_events

Root cause: Bad actors co-scheduled on the machine

Make it better: Throttle or move these bad actors

Page 28: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Performance Degradation Under Load

User: Borglet

User Problem: Stuckness under heavy load

SLO not met: Disk access

Signal: Disk I/O wait time latencies

Root cause: Heavy disk operations blocking other operations

Make it better: Move disk operations away from latency sensitive operations

Page 29: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Future Work

➔ Signals for more resources (e.g.: memory)➔ Using the right signals➔ Better reporting and fleet-wide view to catch regressions across various

components

Helping apps more➔ Where are the problems?➔ Suggest how to fix problems we can’t fix ourselves

Page 30: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Takeaways

➔ Containers are the main enabler: common language for performance signals➔ More data ⇒ better decisions➔ Slicing and dicing of data is priceless for finding patterns and baselines➔ On by default performance monitoring: low overhead and high value➔ Performance SLOs give power to the application and make infrastructure

cheaper

Page 31: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Takeaways

➔ Containers are the main enabler: common language for performance signals➔ More data ⇒ better decisions➔ Slicing and dicing of data is priceless for finding patterns and baselines➔ On by default performance monitoring: low overhead and high value➔ Performance SLOs give power to the application and make infrastructure

cheaper

You can do this too!

Page 32: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Questions?

➔ Containers are the main enabler: common language for performance signals➔ More data ⇒ better decisions➔ Slicing and dicing of data is priceless for finding patterns and baselines➔ On by default performance monitoring: low overhead and high value➔ Performance SLOs give power to the application and make infrastructure

cheaper

You can do this too!

Victor [email protected]

Page 33: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

● Friday 8am - 1pm @ Google's Toronto office● Hear real life experiences of two companies using GKE● Share war stories with your peers● Learn about future plans for microservice management

from Google● Help shape our roadmap

g.co/microservicesroundtable† Must be able to sign digital NDA

Join our Microservices Customer Roundtable

Page 34: ContainerCon 2016: Finding (and Fixing!) Performance Anomalies in Large Scale Distributed Systems

Confidential + Proprietary

Questions?

Images by Connie Zhou