chaos patterns twilio signalconf 2016

32
a CHAOS PATTERNS BRUCE M. WONG | @BRUCE_M_WONG LESSONS ABOUT FAILING WELL AND FAILING OFTEN

Upload: bruce-wong

Post on 14-Apr-2017

341 views

Category:

Engineering


2 download

TRANSCRIPT

a

CHAOS PATTERNS

BRUCE M. WONG | @BRUCE_M_WONG

LESSONS ABOUT FAILING WELL AND FAILING OFTEN

FAILURE HAPPENS

BRUCE M. WONG | @BRUCE_M_WONG

“EVERYTHING FAILS ALL THE TIME”-WERNER VOGELS, CTO, AMAZON WEB SERVICES

HTTP://THENEXTWEB.COM/2008/04/04/WERNER-VOGELS-EVERYTHING-FAILS-ALL-THE-TIME/

BRUCE M. WONG | @BRUCE_M_WONG

THE ORIGINAL CHAOS MONKEY

CREATED BY NETFLIX CLOUD ARCHITECT, GREG ORZELL - @CHAOSSIMIA 2010

BRUCE M. WONG | @BRUCE_M_WONG

HTTPS://WWW.LINKEDIN.COM/IN/GORZELL

a

A STATE OF XENAWS EC2 REBOOT, 2014

BRUCE M. WONG | @BRUCE_M_WONG

HTTP://XENBITS.XEN.ORG/XSA/ADVISORY-108.HTML

HTTP://TECHBLOG.NETFLIX.COM/2014/10/A-STATE-OF-XEN-CHAOS-MONKEY-CASSANDRA.HTML

HTTP://AWS.AMAZON.COM/BLOGS/AWS/EC2-MAINTENANCE-UPDATE/

22 COMPLETE NODE FAILURE

2700+ C* NODES, 218 REBOOTS

0 DOWNTIME

BRUCE M. WONG | @BRUCE_M_WONG

LESSON #1 : TRUST YOUR RESILIENCE

BRUCE M. WONG | @BRUCE_M_WONG

SLOW IS HARD

BRUCE M. WONG | @BRUCE_M_WONG

SLOW IS HARD

BRUCE M. WONG | @BRUCE_M_WONG

UNBOUND QUEUES - ELASTIC ISN’T INFINITE

BRUCE M. WONG | @BRUCE_M_WONG

UNBOUND QUEUES - ELASTIC ISN’T INFINITE

BRUCE M. WONG | @BRUCE_M_WONG

SLOW IS HARD

BRUCE M. WONG | @BRUCE_M_WONG

LATENCY MONKEY

BRUCE M. WONG | @BRUCE_M_WONG

SLOW IS HARD

BRUCE M. WONG | @BRUCE_M_WONG

LATENCY TESTING 2.0 - FIT

HTTP://TECHBLOG.NETFLIX.COM/2014/10/FIT-FAILURE-INJECTION-TESTING.HTML

BRUCE M. WONG | @BRUCE_M_WONG

SLOW IS HARD

BRUCE M. WONG | @BRUCE_M_WONG

SLOW IS HARD

START SLOW

•ACCOUNT LEVEL •+10MS BEFORE +100MS •+1% ERRORS BEFORE +80% ERRORS

DIAL IT UP •A -> D NOT * -> D

BRUCE M. WONG | @BRUCE_M_WONG

LESSON # 2 : FIXING ONE FAILURE MODE EXPOSES NEW ONES

BRUCE M. WONG | @BRUCE_M_WONG

WHATS SO SPECIAL ABOUT CHAOS

BRUCE M. WONG | @BRUCE_M_WONG

CHAOS IS A CHOICE

WHATS SO SPECIAL ABOUT CHAOS

BRUCE M. WONG | @BRUCE_M_WONG

OUTAGES VS CHAOS

BRUCE M. WONG | @BRUCE_M_WONG

OUTAGES VS CHAOSUncontrolled Controlled

Unpredictable Scheduled

Time to Detect: Minutes 0 Time to Detect

Time to Resolve: ???? Time to Resolve: seconds*

Analysis Time: ???? Root Cause Analysis: Intentional

MYTH OF RESILIENCE

NATION’S BUSINESS, 1977

BRUCE M. WONG | @BRUCE_M_WONG

LATENCY MONKEY

BRUCE M. WONG | @BRUCE_M_WONG

LESSON # 3 : THE CULTURE ASPECTS OF CHAOS ARE HARD

BRUCE M. WONG | @BRUCE_M_WONG

BRUCE M. WONG | @BRUCE_M_WONG

MOST ENTERPRISES HIRE PEOPLE TO FIX THINGS. NETFLIX HIRES PEOPLE TO BREAK THINGS….

…WE SHOULD EMBRACE NETFLIX'S CULTURE OF "CHAOS ENGINEERING" THROUGHOUT ORGANIZATIONS OF ALL SHAPES AND SIZES.

BRUCE M. WONG | @BRUCE_M_WONG

SEEK PROGRESS OVER PERFECTIONTWILIO LEADERSHIP PRINCIPLE

BRUCE M. WONG | @BRUCE_M_WONG

GAME DAYS - BENEFITS

•Training New Engineers

•Discover Instrumentation gaps

•New Product Launches

•Incident Management Practices

BRUCE M. WONG | @BRUCE_M_WONG

GAME DAYS - THE SETUP

•Two “on-call” teams

•Separate rooms, separate slack channels

•Master of Disaster

•Incident Commander

BRUCE M. WONG | @BRUCE_M_WONG

LEVERAGE EXISTING TESTBOTS

•Functionally test fallback code

•Early warning!

•Existing Integrations with Telemetry, PagerDuty, Slack

•Incorporate into Canary processFUTURE

BRUCE M. WONG | @BRUCE_M_WONG

RECAP

Lesson # 1 : Trust your resilience

Lesson # 2 : Fixing one failure mode exposes new ones

Lesson # 3 : The culture aspects of Chaos are HARD

Get started today!

Game Days are your friend - do them early and often

Testbots + focus on developer productivity

BRUCE M. WONG | @BRUCE_M_WONG

WHEN YOU WISH UPON A BLUE MOON

BRUCE M. WONG | @BRUCE_M_WONG