chaos patterns

47
CHAOS PATTERNS Architecting for failure in distributed systems Bruce Wong - @bruce_m_wong / Jos Boumans - @jiboumans http://www.soponderando.com.br/

Upload: bruce-wong

Post on 14-Apr-2017

382 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Chaos Patterns

CHAOS PATTERNSArchitecting for failure in distributed systems

Bruce Wong - @bruce_m_wong / Jos Boumans - @jiboumanshttp://www.soponderando.com.br/

Page 2: Chaos Patterns

http://fotos.subefotos.com/7a6b3e6df9453d5adf150087e5300834o.jpg

How to measure everything

Architecting in AWS for

resilience & cost

www.slideshare.net/jiboumans/aws-architecting-for-resilience-cost-at-scale http://www.slideshare.net/jiboumans/how-to-measure-everything-a-million-metrics-per-second-with-minimal-developer-overhead

Page 3: Chaos Patterns

VP of Operations & Infrastructure

http://www.krux.com/

3 Billion Users

Page 4: Chaos Patterns

ABOUT BRUCE

2010 2015

Software Engineer

Insight Engineering

Senior Engineering Manager

Chaos Engineering

Prosumers Consumers Enterprise

http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html

Page 5: Chaos Patterns

A LOT OF TRAFFIChttp://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html

Page 6: Chaos Patterns

http://grandprix247.com/2012/09/03/spa-pile-up-renews-focus-on-formula-1-safety-matters/

REAL WORLD FAILURES

Page 7: Chaos Patterns

SEPTEMBER 20TH, 2015Also: April 21, 2011 - June 29, 2012 - October 22, 2012 - December 24, 2012 - August 26, 2013 <out of space>

https://twitter.com/iamDeveloper/status/645659734767329281 https://aws.amazon.com/message/5467D2/

Page 8: Chaos Patterns

ISOLATION & CONTAINMENTIdeally limit failure to a single service

Stop it from spreadinghttp://businessnerds.wordpress.com/2011/05/28/so-far-so-good…-the-review/

Page 9: Chaos Patterns

So#ware,)8)

Automa/on,)4)

Process,)14)

#"of"Issues"

Amazon"Cloud"Major"Outage"7"Issues"Categories"

https://steamcommunity.com/app/620/ http://fotos.subefotos.com/7a6b3e6df9453d5adf150087e5300834o.jpg

AWS Root Cause Analysis over time

http://www.slideshare.net/rahultyagi50999/amazon-cloud-major-outages-analysis

Page 10: Chaos Patterns

Humans, Software, Processes

All likely causes of failure

Isolation Unlikely

2 - 4x Yearly frequency of catastrophic failure

Page 11: Chaos Patterns

THERE ARE DOWNSIDEShttp://modernsavage.hubpages.com/hub/10-springfield-shopper-headlines

Page 12: Chaos Patterns

Complex SystemsDifficult to model, not feasible to simulate at scale

Page 13: Chaos Patterns

Software is Iterativetesting, code coverage, “agile”

Page 14: Chaos Patterns

Resilience Design is also Iterative…unlike software, complexity makes testing difficult

Page 15: Chaos Patterns

Twitter

Page 16: Chaos Patterns

Rich Search ExperienceMany optional enhancements

Page 17: Chaos Patterns

http://usa.streetsblog.org/category/issues-campaigns/air-quality/

NAVIGATING THE CHAOS

Page 18: Chaos Patterns

FALLBACK PATTERNS“Expect the Unexpected”

http://blabitcanada.com/category/twitter-2/

Page 19: Chaos Patterns

BASIC API CALL3 potential points of failure

Page 20: Chaos Patterns

FALLBACK PATTERNSThe cost of resilience should be accuracy or latency

http://redis.io/ http://memcached.org/

http://varnish-cache.org/

Page 21: Chaos Patterns

ENSURING DATA ACCESS

https://www.flickr.com/photos/ichijo2009/8501266124

Page 22: Chaos Patterns

CAP THEOREM APPLIESYour choice: sacrifice availability or consistency. Orange is a lie.

RDBMS BigTable Based

Master / Slave based

CouchDB Dynamo Based

http://ferd.ca/beating-the-cap-theorem-checklist.html

Page 23: Chaos Patterns

SPLIT OUT YOUR CONTROL PLANE

http://paul-barford.blogspot.com/2015/01/sappho-pap-obbink-further-painting-into.html

Page 24: Chaos Patterns

EC2 EMR RDS

Dynamo

Cloudfront CDN

Route53 DNS

Cloudwatch Monitoring

Page 25: Chaos Patterns

Cloudfront CDN

Route53 DNS

Cloudwatch Monitoring

Page 26: Chaos Patterns

Control plane Separate

from workload

DNS & CDN Your best friends

Latency or Accuracy

Pick one to sacrificefor resilience

Page 27: Chaos Patterns

USER EXPERIENCEMy tweet got posted

Page 28: Chaos Patterns

http://mclaughlindrums.com/wp-content/uploads/2013/04/Relativity-by-Escher.jpg

ORDERED CHAOS

Page 29: Chaos Patterns

Nation’s Business, 1977

Page 30: Chaos Patterns

CHAOS DEFINED

Intentionally introducing failure into a system with the purpose of validating resilience design.

Page 31: Chaos Patterns

http://www.cnbc.com/id/102394893

Page 32: Chaos Patterns

BREAKING THE SYSTEM

How Confident are you?

-Next week?

-Next month?

-After that “quick patch”

Page 33: Chaos Patterns

CHAOS VS OUTAGEChaos

• Controlled

• Planned

• Intentional

• Microscopic user impact

Outages

• Uncontrolled

• Unpredictable

• Unintended

• Large impact

Page 34: Chaos Patterns

Single Point of FailureDiscover - Fix - Validate

Page 35: Chaos Patterns

CHAOS MONKEY

http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.htmlhttps://github.com/Netflix/SimianArmy

Page 36: Chaos Patterns

9am-5pm Mon-Fri Don’t upset your on-call

1 Instance Per group / per day

Detect SPOF Intentionally

Page 37: Chaos Patterns

Slow is HardProduct + Business + Engineering Decisions

https://pragprog.com/book/mnee/release-it

Page 38: Chaos Patterns

Custom Fallback

accuracy or latency

Fail Silent For optional data

Fail Fast to keep servers healthy

Page 39: Chaos Patterns

LATENCY MONKEYother frameworks

http://www.infoq.com/presentations/failure-as-a-service-netflix

http://techblog.netflix.com/2014/10/fit-failure-injection-testing.html

Page 40: Chaos Patterns

HTTP 5xx 1 minute duration

10-100ms Sleep during request

1-100% Of requests

Page 41: Chaos Patterns

Prevent Propagationto avoid cascading failure

Page 42: Chaos Patterns

CHAOS KONGbecause regions fail

http://techblog.netflix.com/2015/09/chaos-engineering-upgraded.html

Page 43: Chaos Patterns

GeoDNS fallback to LatencyDNS

Proxy Cross-Region

communication

Capacity Cost-Benefit Decision

Page 44: Chaos Patterns

"ONCE IN A BLUE MOON"Happens at least a few times a year....

https://whisperofangels.wordpress.com/2013/08/20/once-in-a-blue-moon/

Page 45: Chaos Patterns

TAKE AWAYgo found chaos engineering at your company RIGHT

NOW

Page 46: Chaos Patterns

Most enterprises hire people to fix things. Netflix hires people to break things….

…we should embrace Netflix's culture of "chaos engineering" throughout organizations of all shapes and sizes.

http://readwrite.com/2014/09/17/netflix-chaos-engineering-for-everyone

Page 47: Chaos Patterns

Q & A

http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html

@bruce_m_wong / @jiboumansSlides - https://www.linkedin.com/in/brucemwong