(bdt207) real-time analytics in service of self-healing ecosystems
TRANSCRIPT
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Chris Sanden, Netflix
Roy Rapoport, Netflix
October 2015
BDT207
Real-Time Analytics In Service of
Self-Healing Ecosystems
Expectations
(Reasonable)
Telemetry
System
Real-Time
Analytics
System(s)
Data
Orchestration
Systems
Decision
Observation
MMO: Most Memorable Outage
• One device (out of ~103)
• One test cell (out of ~101)
• One test (out of ~104)
• Couldn’t view House of Cards S3E1
• For a week
Scale At Scale
We have weird, device-specific problems all
the time, and interactions with A/B tests only
make them more complicated, so I'm not
sure we have a pat moral of the story except
that we really like alerting and fast
responses.
- Matt McCarthy
Bad News About the Cloud
• Infrastructure no longer the bottleneck
• Before: Weeks to change infrastructure
• After: API call
• TTD expectations vastly higher
• AWS makes us the lameness bottleneck
Good News About the Cloud
• Infrastructure no longer the bottleneck
• Before: Weeks to change infrastructure
• After: API call
• Rapid recovery, automated response
possible
• AWS: Enabling productive laziness for 9
years and counting
Don’t Forget to Bring a Towel!Monitoring Capabilities You’ll Find Useful
• Time series
• Event Streaming
• Dependency Discovery and Inspection
Predictive Scaling
Auto Scaling is reactive.• SCALE UP by 10%
• WHEN Requests Per Second > 120
• FOR 10 consecutive minutes
• FOLLOWED-BY a cool-down of 15 minutes
Predictive Scaling
Advanced Use Cases• Rapid, reoccurring, spike in demand
• Variable traffic patterns
• Outages
Predictive Scaling
Concept• Anticipate change in traffic and workload.
• Predict the resources needed a head of time.
• Proactively scale up or down.
Predictive Scaling
Metric Selection• Clear, relatively stable, and recurring pattern.
• Independent of cluster performance.
Predictive Scaling
Predictive-reactive Auto Scaling• A hybrid approach
• Predict the workload of a cluster in advance and proactively scale.
• Use auto scaling to handle unexpected surges in workload.
a·nom·a·ly de·tec·tion[uh-nom-uh-lee] [dih-tek-shuh n]
1. identification of observations which do not conform to an expected pattern.
2. a task that keeps data scientists up at night.
Anomaly Detection
Prediction Algorithms• FFT-based Prediction
• Double Exponential Smoothing (DES)
• Holt-Winters
• ARIMA
• Etc.
Anomaly Detection
Statistical Techniques• Three-sigma (3-sigma)
• Kolmogorov-Smirnov (KS)
• Interquartile Range (IQR)
• Grubbs Test
• Least Squares
• Etc.
Anomaly Detection
Metric Prediction Residual IQRCombine
Votes
3-sigma
KS
Detection Workflow - Ensemble Approach
Anomaly Detection
Advanced Detection Techniques• Robust Anomaly Detection (RAD) - Netflix
• Seasonal Hybrid ESD - Twitter
• Extendible Generic Anomaly Detection System (EGADS) - Yahoo
• Kale - Etsy
out·li·er de·tec·tion[out-lahy-er] [dih-tek-shuh n]
1. identification of unusual members from a set of generating mechanisms.
2. not be confused with anomaly detection.
Server Outlier Detection
Netflix runs on thousands of servers• A small percentage of servers become unhealthy.
• Customer experience may be degraded.
• Time wasted looking for evidence.
Server Outlier Detection
Cluster Analysis• Unsupervised machine learning.
• If a server belongs to a group it should be near lots of other points as
measured by some distance function.
Assumption• Servers running the same hardware and software should behave similar.
Server Outlier Detection
Actions / Remediation• Send e-mail
• Page service owner
• Terminate instance
• Remove from service
• Detach from a load balancer
Automated Canary Analysis
Canary Release Process• A change is gradually rolled out to production.
• Checkpoints are performed along the way.
• A decision is made at each checkpoint.
Automated Canary Analysis
Advantages• Better degree of trust and safety in deployments.
• Faster deployment cadence.
• Lower investment in simulation engineering.
Automated Canary AnalysisCanary Process
Current Version
(v1.0)
New Version
(v1.1)
Load
BalancerTraffic
100 Servers
5 Servers
95%
5%
Metrics
Automated Canary AnalysisCanary Process
Current Version
(v1.0)
New Version
(v1.1)
Load
BalancerTraffic
0 Servers
100 Servers
100%
Metrics
Automated Canary Analysis
Automated Analysis• Identify a set of metrics to compare.
• Use a statistical test to identify the difference between v1.0 and v1.1
• Mann–Whitney
• Kolmogorov-Smirnov
• Generate a score that indicates overall similarity.
• Percentage of metrics that match in performance.
Automated Canary AnalysisAugmented Canary Process
Previous Version
(v1.0)
New Version
(Canary - v1.1)
Load
BalancerTraffic
88 Servers
6 Servers
Previous Version
(Control - v1.0)
6 Servers
AnalysisMetrics
Correlation AnalysisAutomated Finger-Pointing for Fun and Profit
You Want Service-Oriented Architecture?
We’ve got Service-Oriented Architecture
Correlation AnalysisSomething Else Is Also Weird!
CPU up
Alert triggered
HTTP 400
Correlated spike
HTTP requests
Correlated drop
Correlation AnalysisIf you care about this, you don’t care about that …
I Care About
This Metric!
I Also Care About
This Metric!
Maybe not?
In Conclusion
(Reasonable)
Telemetry
System
Real-Time
Analytics
System(s)
Data
Orchestration
Systems
Decision
Observation
Useful Links
• Prediction• Predictive Auto Scaling: http://techblog.netflix.com/2013/12/scryer-netflixs-predictive-auto-scaling.htm
• FFT: https://en.wikipedia.org/wiki/Fast_Fourier_transform
• Detection• Double Exponential Smoothing: https://en.wikipedia.org/wiki/Exponential_smoothing
• Interquartile Range (IQR): https://en.wikipedia.org/wiki/Interquartile_range
• Ensemble Learning: http://www.scholarpedia.org/article/Ensemble_learning
• Robust Anomaly Detection (RAD): http://techblog.netflix.com/2015/02/rad-outlier-detection-on-big-data.html
• DBSCAN: https://en.wikipedia.org/wiki/DBSCAN
• Server Outlier Detection: http://techblog.netflix.com/2015/07/tracking-down-villains-outlier.html
• Canary Release Process: http://martinfowler.com/bliki/CanaryRelease.html
• Automated Canary Analysis: http://www.infoq.com/presentations/canary-analysis-deployment-pattern
• Nonparametric tests: https://en.wikipedia.org/wiki/Nonparametric_statistics
• Correlation
• Pearson Correlation: https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
Attributions
• http://aggronaut.com
• http://designsold.com/pictures-of-kittens/
• http://slate.com, Illustration by Phil Plait
• http://www-rohan.sdsu.edu/
• http://scikit-learn.org/stable/documentation.html