(bdt207) real-time analytics in service of self-healing ecosystems

85
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Chris Sanden, Netflix Roy Rapoport, Netflix October 2015 BDT207 Real-Time Analytics In Service of Self-Healing Ecosystems

Upload: amazon-web-services

Post on 16-Apr-2017

1.308 views

Category:

Technology


1 download

TRANSCRIPT

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Chris Sanden, Netflix

Roy Rapoport, Netflix

October 2015

BDT207

Real-Time Analytics In Service of

Self-Healing Ecosystems

@chris_sanden

Chris & Roy

@royrapoport

Prerequisites

Expectations

(Reasonable)

Telemetry

System

Real-Time

Analytics

System(s)

Data

Orchestration

Systems

Decision

Observation

Bad News: An Evolution

Bad News: An Evolution

Bad News: An Evolution

Not Bad

• Absolutely necessary

• Pretty useful

• Insufficient

Scale At Scale

We’ve Got

1,982,562,395

ProblemsAnd Boredom Ain’t One

Scale At Scale

Complexity in a

Few* Dimensions

* For sufficiently large values of “few”

Scale At Scale

421,010

Scale At Scale

Telemetry Volumeis silly

Scale At Scale

2,000,000,000is silly

Scale At Scale

14”

Scale At Scale

Scale At Scale

Scale At Scale

MMO: Most Memorable Outage

• One device (out of ~103)

• One test cell (out of ~101)

• One test (out of ~104)

• Couldn’t view House of Cards S3E1

• For a week

Scale At Scale

We have weird, device-specific problems all

the time, and interactions with A/B tests only

make them more complicated, so I'm not

sure we have a pat moral of the story except

that we really like alerting and fast

responses.

- Matt McCarthy

Bad News About the Cloud

• Infrastructure no longer the bottleneck

• Before: Weeks to change infrastructure

• After: API call

• TTD expectations vastly higher

• AWS makes us the lameness bottleneck

Good News About the Cloud

• Infrastructure no longer the bottleneck

• Before: Weeks to change infrastructure

• After: API call

• Rapid recovery, automated response

possible

• AWS: Enabling productive laziness for 9

years and counting

Don’t Forget to Bring a Towel!Monitoring Capabilities You’ll Find Useful

• Time series

• Event Streaming

• Dependency Discovery and Inspection

Real-Time Analytics

1. Prediction

2. Detection

3. Correlation

1. Prediction

1.1 Predictive Scaling

Predictive Scaling

Auto Scaling is reactive.• SCALE UP by 10%

• WHEN Requests Per Second > 120

• FOR 10 consecutive minutes

• FOLLOWED-BY a cool-down of 15 minutes

Predictive Scaling

Advanced Use Cases• Rapid, reoccurring, spike in demand

• Variable traffic patterns

• Outages

Predictive Scaling

Concept• Anticipate change in traffic and workload.

• Predict the resources needed a head of time.

• Proactively scale up or down.

Predictive Scaling

Metric Selection• Clear, relatively stable, and recurring pattern.

• Independent of cluster performance.

Predictive ScalingRequests Per Second (RPS)

Predictive ScalingFast Fourier Transformation (FFT)

Predictive ScalingFFT-based Prediction

Prediction

Predictive Scaling

Action Plan

Predictive Scaling

MetricFFT

PredictionAction Plan

Scale

Prediction Workflow

Predictive Scaling

Predictive-reactive Auto Scaling• A hybrid approach

• Predict the workload of a cluster in advance and proactively scale.

• Use auto scaling to handle unexpected surges in workload.

2. Detection

2.1 Anomaly Detection

a·nom·a·ly de·tec·tion[uh-nom-uh-lee] [dih-tek-shuh n]

1. identification of observations which do not conform to an expected pattern.

2. a task that keeps data scientists up at night.

“Blips”

Anomaly DetectionAnomaly Types

“Bloops”

Anomaly DetectionStatic Threshold

Anomaly Detection Static Threshold

Anomaly Detection

Prediction Algorithms• FFT-based Prediction

• Double Exponential Smoothing (DES)

• Holt-Winters

• ARIMA

• Etc.

Anomaly Detection

Metric Prediction Residual Threshold

Detection Workflow

Anomaly DetectionDouble Exponential Smoothing

Anomaly DetectionDouble Exponential Smoothing

Anomaly DetectionDouble Exponential Smoothing

Anomaly Detection

Statistical Techniques• Three-sigma (3-sigma)

• Kolmogorov-Smirnov (KS)

• Interquartile Range (IQR)

• Grubbs Test

• Least Squares

• Etc.

Anomaly Detection

Metric Prediction Residual Threshold

Detection Workflow

Anomaly Detection

Metric Prediction Residual 3-sigma

Detection Workflow

Anomaly Detection

Metric Prediction Residual IQRCombine

Votes

3-sigma

KS

Detection Workflow - Ensemble Approach

Anomaly Detection

Advanced Detection Techniques• Robust Anomaly Detection (RAD) - Netflix

• Seasonal Hybrid ESD - Twitter

• Extendible Generic Anomaly Detection System (EGADS) - Yahoo

• Kale - Etsy

2.2 Outlier Detection

out·li·er de·tec·tion[out-lahy-er] [dih-tek-shuh n]

1. identification of unusual members from a set of generating mechanisms.

2. not be confused with anomaly detection.

Time

Popu

lation

Outlier Detection

Anomaly

Detection

Server Outlier Detection

Netflix runs on thousands of servers• A small percentage of servers become unhealthy.

• Customer experience may be degraded.

• Time wasted looking for evidence.

Server Outlier Detection

Server Outlier Detection

Server Outlier Detection

Cluster Analysis• Unsupervised machine learning.

• If a server belongs to a group it should be near lots of other points as

measured by some distance function.

Assumption• Servers running the same hardware and software should behave similar.

Server Outlier Detection

DBSCAN - Density-Based Spatial Clustering of Applications with Noise

Server Outlier Detection

Metric DBSCAN Filter Action

Detection Workflow

Server Outlier Detection

Actions / Remediation• Send e-mail

• Page service owner

• Terminate instance

• Remove from service

• Detach from a load balancer

2.3 Automated Canary Analysis

Automated Canary Analysis

Canary Release Process• A change is gradually rolled out to production.

• Checkpoints are performed along the way.

• A decision is made at each checkpoint.

Automated Canary Analysis

Advantages• Better degree of trust and safety in deployments.

• Faster deployment cadence.

• Lower investment in simulation engineering.

Automated Canary AnalysisCanary Process

Current Version

(v1.0)

New Version

(v1.1)

Load

BalancerTraffic

100 Servers

5 Servers

95%

5%

Metrics

Automated Canary AnalysisCanary Process

Current Version

(v1.0)

New Version

(v1.1)

Load

BalancerTraffic

0 Servers

100 Servers

100%

Metrics

Automated Canary Analysis

Automated Canary Analysis

Automated Analysis• Identify a set of metrics to compare.

• Use a statistical test to identify the difference between v1.0 and v1.1

• Mann–Whitney

• Kolmogorov-Smirnov

• Generate a score that indicates overall similarity.

• Percentage of metrics that match in performance.

Automated Canary Analysis

MetricsStatistical

TestCalculate

ScoreDecision

Analysis Workflow

Automated Canary AnalysisAugmented Canary Process

Previous Version

(v1.0)

New Version

(Canary - v1.1)

Load

BalancerTraffic

88 Servers

6 Servers

Previous Version

(Control - v1.0)

6 Servers

AnalysisMetrics

Automated Canary Analysis

3. Correlation

Correlation AnalysisAutomated Finger-Pointing for Fun and Profit

You Want Service-Oriented Architecture?

We’ve got Service-Oriented Architecture

Correlation AnalysisAutomated Finger-Pointing for Fun and Profit

A

B C

D

E

F

G

H

I

J

Correlation AnalysisAutomated Finger-Pointing for Fun and Profit

A

B C

D

E

F

G

H

I

J

Correlation AnalysisSomething Else Is Also Weird!

CPU up

Alert triggered

HTTP 400

Correlated spike

HTTP requests

Correlated drop

Correlation AnalysisIf you care about this, you don’t care about that …

I Care About

This Metric!

I Also Care About

This Metric!

Maybe not?

Conclusion

Magic!

In Conclusion

Not Magic!

DES

IQR

FFT

DBSCAN

RAD/RPCA

In Conclusion

(Reasonable)

Telemetry

System

Real-Time

Analytics

System(s)

Data

Orchestration

Systems

Decision

Observation

In Conclusion

Wanna play?

Useful Links

• Prediction• Predictive Auto Scaling: http://techblog.netflix.com/2013/12/scryer-netflixs-predictive-auto-scaling.htm

• FFT: https://en.wikipedia.org/wiki/Fast_Fourier_transform

• Detection• Double Exponential Smoothing: https://en.wikipedia.org/wiki/Exponential_smoothing

• Interquartile Range (IQR): https://en.wikipedia.org/wiki/Interquartile_range

• Ensemble Learning: http://www.scholarpedia.org/article/Ensemble_learning

• Robust Anomaly Detection (RAD): http://techblog.netflix.com/2015/02/rad-outlier-detection-on-big-data.html

• DBSCAN: https://en.wikipedia.org/wiki/DBSCAN

• Server Outlier Detection: http://techblog.netflix.com/2015/07/tracking-down-villains-outlier.html

• Canary Release Process: http://martinfowler.com/bliki/CanaryRelease.html

• Automated Canary Analysis: http://www.infoq.com/presentations/canary-analysis-deployment-pattern

• Nonparametric tests: https://en.wikipedia.org/wiki/Nonparametric_statistics

• Correlation

• Pearson Correlation: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient

Attributions

• http://aggronaut.com

• http://designsold.com/pictures-of-kittens/

• http://slate.com, Illustration by Phil Plait

• http://www-rohan.sdsu.edu/

• http://scikit-learn.org/stable/documentation.html

Remember to complete

your evaluations!

Thank you!

@chris_sanden

@royrapoport