chaos patterns twilio signalconf 2016
TRANSCRIPT
“EVERYTHING FAILS ALL THE TIME”-WERNER VOGELS, CTO, AMAZON WEB SERVICES
HTTP://THENEXTWEB.COM/2008/04/04/WERNER-VOGELS-EVERYTHING-FAILS-ALL-THE-TIME/
BRUCE M. WONG | @BRUCE_M_WONG
THE ORIGINAL CHAOS MONKEY
CREATED BY NETFLIX CLOUD ARCHITECT, GREG ORZELL - @CHAOSSIMIA 2010
BRUCE M. WONG | @BRUCE_M_WONG
HTTPS://WWW.LINKEDIN.COM/IN/GORZELL
HTTP://XENBITS.XEN.ORG/XSA/ADVISORY-108.HTML
HTTP://TECHBLOG.NETFLIX.COM/2014/10/A-STATE-OF-XEN-CHAOS-MONKEY-CASSANDRA.HTML
HTTP://AWS.AMAZON.COM/BLOGS/AWS/EC2-MAINTENANCE-UPDATE/
22 COMPLETE NODE FAILURE
2700+ C* NODES, 218 REBOOTS
0 DOWNTIME
BRUCE M. WONG | @BRUCE_M_WONG
LATENCY TESTING 2.0 - FIT
HTTP://TECHBLOG.NETFLIX.COM/2014/10/FIT-FAILURE-INJECTION-TESTING.HTML
BRUCE M. WONG | @BRUCE_M_WONG
SLOW IS HARD
START SLOW
•ACCOUNT LEVEL •+10MS BEFORE +100MS •+1% ERRORS BEFORE +80% ERRORS
DIAL IT UP •A -> D NOT * -> D
BRUCE M. WONG | @BRUCE_M_WONG
BRUCE M. WONG | @BRUCE_M_WONG
OUTAGES VS CHAOSUncontrolled Controlled
Unpredictable Scheduled
Time to Detect: Minutes 0 Time to Detect
Time to Resolve: ???? Time to Resolve: seconds*
Analysis Time: ???? Root Cause Analysis: Intentional
BRUCE M. WONG | @BRUCE_M_WONG
MOST ENTERPRISES HIRE PEOPLE TO FIX THINGS. NETFLIX HIRES PEOPLE TO BREAK THINGS….
…WE SHOULD EMBRACE NETFLIX'S CULTURE OF "CHAOS ENGINEERING" THROUGHOUT ORGANIZATIONS OF ALL SHAPES AND SIZES.
GAME DAYS - BENEFITS
•Training New Engineers
•Discover Instrumentation gaps
•New Product Launches
•Incident Management Practices
BRUCE M. WONG | @BRUCE_M_WONG
GAME DAYS - THE SETUP
•Two “on-call” teams
•Separate rooms, separate slack channels
•Master of Disaster
•Incident Commander
BRUCE M. WONG | @BRUCE_M_WONG
LEVERAGE EXISTING TESTBOTS
•Functionally test fallback code
•Early warning!
•Existing Integrations with Telemetry, PagerDuty, Slack
•Incorporate into Canary processFUTURE
BRUCE M. WONG | @BRUCE_M_WONG
RECAP
Lesson # 1 : Trust your resilience
Lesson # 2 : Fixing one failure mode exposes new ones
Lesson # 3 : The culture aspects of Chaos are HARD
Get started today!
Game Days are your friend - do them early and often
Testbots + focus on developer productivity
BRUCE M. WONG | @BRUCE_M_WONG