twilio signal 2016 chaos patterns

32
a CHAOS PATTERNS BRUCE M. WONG | @BRUCE_M_WONG LESSONS ABOUT FAILING WELL AND FAILING OFTEN

Upload: twilio-inc

Post on 15-Apr-2017

113 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Twilio Signal 2016 Chaos Patterns

a

CHAOS PATTERNS

BRUCE M. WONG | @BRUCE_M_WONG

LESSONS ABOUT FAILING WELL AND FAILING OFTEN

Page 2: Twilio Signal 2016 Chaos Patterns

FAILURE HAPPENS

BRUCE M. WONG | @BRUCE_M_WONG

Page 3: Twilio Signal 2016 Chaos Patterns

“EVERYTHING FAILS ALL THE TIME”-WERNER VOGELS, CTO, AMAZON WEB SERVICES

HTTP://THENEXTWEB.COM/2008/04/04/WERNER-VOGELS-EVERYTHING-FAILS-ALL-THE-TIME/

BRUCE M. WONG | @BRUCE_M_WONG

Page 4: Twilio Signal 2016 Chaos Patterns

THE ORIGINAL CHAOS MONKEY

CREATED BY NETFLIX CLOUD ARCHITECT, GREG ORZELL - @CHAOSSIMIA 2010

BRUCE M. WONG | @BRUCE_M_WONG

HTTPS://WWW.LINKEDIN.COM/IN/GORZELL

Page 5: Twilio Signal 2016 Chaos Patterns

a

A STATE OF XENAWS EC2 REBOOT, 2014

BRUCE M. WONG | @BRUCE_M_WONG

Page 6: Twilio Signal 2016 Chaos Patterns

HTTP://XENBITS.XEN.ORG/XSA/ADVISORY-108.HTML

HTTP://TECHBLOG.NETFLIX.COM/2014/10/A-STATE-OF-XEN-CHAOS-MONKEY-CASSANDRA.HTML

HTTP://AWS.AMAZON.COM/BLOGS/AWS/EC2-MAINTENANCE-UPDATE/

22 COMPLETE NODE FAILURE

2700+ C* NODES, 218 REBOOTS

0 DOWNTIME

BRUCE M. WONG | @BRUCE_M_WONG

Page 7: Twilio Signal 2016 Chaos Patterns

LESSON #1 : TRUST YOUR RESILIENCE

BRUCE M. WONG | @BRUCE_M_WONG

Page 8: Twilio Signal 2016 Chaos Patterns

SLOW IS HARD

BRUCE M. WONG | @BRUCE_M_WONG

Page 9: Twilio Signal 2016 Chaos Patterns

SLOW IS HARD

BRUCE M. WONG | @BRUCE_M_WONG

Page 10: Twilio Signal 2016 Chaos Patterns

UNBOUND QUEUES - ELASTIC ISN’T INFINITE

BRUCE M. WONG | @BRUCE_M_WONG

Page 11: Twilio Signal 2016 Chaos Patterns

UNBOUND QUEUES - ELASTIC ISN’T INFINITE

BRUCE M. WONG | @BRUCE_M_WONG

Page 12: Twilio Signal 2016 Chaos Patterns

SLOW IS HARD

BRUCE M. WONG | @BRUCE_M_WONG

Page 13: Twilio Signal 2016 Chaos Patterns

LATENCY MONKEY

BRUCE M. WONG | @BRUCE_M_WONG

Page 14: Twilio Signal 2016 Chaos Patterns

SLOW IS HARD

BRUCE M. WONG | @BRUCE_M_WONG

Page 15: Twilio Signal 2016 Chaos Patterns

LATENCY TESTING 2.0 - FIT

HTTP://TECHBLOG.NETFLIX.COM/2014/10/FIT-FAILURE-INJECTION-TESTING.HTML

BRUCE M. WONG | @BRUCE_M_WONG

Page 16: Twilio Signal 2016 Chaos Patterns

SLOW IS HARD

BRUCE M. WONG | @BRUCE_M_WONG

Page 17: Twilio Signal 2016 Chaos Patterns

SLOW IS HARD

START SLOW

•ACCOUNT LEVEL •+10MS BEFORE +100MS •+1% ERRORS BEFORE +80% ERRORS

DIAL IT UP •A -> D NOT * -> D

BRUCE M. WONG | @BRUCE_M_WONG

Page 18: Twilio Signal 2016 Chaos Patterns

LESSON # 2 : FIXING ONE FAILURE MODE EXPOSES NEW ONES

BRUCE M. WONG | @BRUCE_M_WONG

Page 19: Twilio Signal 2016 Chaos Patterns

WHATS SO SPECIAL ABOUT CHAOS

BRUCE M. WONG | @BRUCE_M_WONG

CHAOS IS A CHOICE

Page 20: Twilio Signal 2016 Chaos Patterns

WHATS SO SPECIAL ABOUT CHAOS

BRUCE M. WONG | @BRUCE_M_WONG

OUTAGES VS CHAOS

Page 21: Twilio Signal 2016 Chaos Patterns

BRUCE M. WONG | @BRUCE_M_WONG

OUTAGES VS CHAOSUncontrolled Controlled

Unpredictable Scheduled

Time to Detect: Minutes 0 Time to Detect

Time to Resolve: ???? Time to Resolve: seconds*

Analysis Time: ???? Root Cause Analysis: Intentional

Page 22: Twilio Signal 2016 Chaos Patterns

MYTH OF RESILIENCE

NATION’S BUSINESS, 1977

BRUCE M. WONG | @BRUCE_M_WONG

Page 23: Twilio Signal 2016 Chaos Patterns

LATENCY MONKEY

BRUCE M. WONG | @BRUCE_M_WONG

Page 24: Twilio Signal 2016 Chaos Patterns

LESSON # 3 : THE CULTURE ASPECTS OF CHAOS ARE HARD

BRUCE M. WONG | @BRUCE_M_WONG

Page 25: Twilio Signal 2016 Chaos Patterns

BRUCE M. WONG | @BRUCE_M_WONG

MOST ENTERPRISES HIRE PEOPLE TO FIX THINGS. NETFLIX HIRES PEOPLE TO BREAK THINGS….

…WE SHOULD EMBRACE NETFLIX'S CULTURE OF "CHAOS ENGINEERING" THROUGHOUT ORGANIZATIONS OF ALL SHAPES AND SIZES.

Page 26: Twilio Signal 2016 Chaos Patterns

BRUCE M. WONG | @BRUCE_M_WONG

Page 27: Twilio Signal 2016 Chaos Patterns

SEEK PROGRESS OVER PERFECTIONTWILIO LEADERSHIP PRINCIPLE

BRUCE M. WONG | @BRUCE_M_WONG

Page 28: Twilio Signal 2016 Chaos Patterns

GAME DAYS - BENEFITS

•Training New Engineers

•Discover Instrumentation gaps

•New Product Launches

•Incident Management Practices

BRUCE M. WONG | @BRUCE_M_WONG

Page 29: Twilio Signal 2016 Chaos Patterns

GAME DAYS - THE SETUP

•Two “on-call” teams

•Separate rooms, separate slack channels

•Master of Disaster

•Incident Commander

BRUCE M. WONG | @BRUCE_M_WONG

Page 30: Twilio Signal 2016 Chaos Patterns

LEVERAGE EXISTING TESTBOTS

•Functionally test fallback code

•Early warning!

•Existing Integrations with Telemetry, PagerDuty, Slack

•Incorporate into Canary processFUTURE

BRUCE M. WONG | @BRUCE_M_WONG

Page 31: Twilio Signal 2016 Chaos Patterns

RECAP

Lesson # 1 : Trust your resilience

Lesson # 2 : Fixing one failure mode exposes new ones

Lesson # 3 : The culture aspects of Chaos are HARD

Get started today!

Game Days are your friend - do them early and often

Testbots + focus on developer productivity

BRUCE M. WONG | @BRUCE_M_WONG

Page 32: Twilio Signal 2016 Chaos Patterns

WHEN YOU WISH UPON A BLUE MOON

BRUCE M. WONG | @BRUCE_M_WONG