Chaos Patterns

Download Chaos Patterns

Post on 14-Apr-2017

359 views

Category:

Documents

2 download

Embed Size (px)

TRANSCRIPT

  • CHAOS PATTERNSArchitecting for failure in distributed systems

    Bruce Wong - @bruce_m_wong / Jos Boumans - @jiboumanshttp://www.soponderando.com.br/

    http://www.soponderando.com.br/

  • http://fotos.subefotos.com/7a6b3e6df9453d5adf150087e5300834o.jpg

    How to measure everything

    Architecting in AWS for

    resilience & cost

    www.slideshare.net/jiboumans/aws-architecting-for-resilience-cost-at-scale http://www.slideshare.net/jiboumans/how-to-measure-everything-a-million-metrics-per-second-with-minimal-developer-overhead

    http://fotos.subefotos.com/7a6b3e6df9453d5adf150087e5300834o.jpghttp://www.slideshare.net/jiboumans/aws-architecting-for-resilience-cost-at-scalehttp://www.slideshare.net/jiboumans/how-to-measure-everything-a-million-metrics-per-second-with-minimal-developer-overhead

  • VP of Operations & Infrastructure

    http://www.krux.com/

    3 Billion Users

    http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html

  • ABOUT BRUCE

    2010 2015

    Software Engineer

    Insight Engineering

    Senior Engineering Manager

    Chaos Engineering

    Prosumers Consumers Enterprise

    http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html

  • A LOT OF TRAFFIChttp://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html

    http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html

  • http://grandprix247.com/2012/09/03/spa-pile-up-renews-focus-on-formula-1-safety-matters/

    REAL WORLD FAILURES

    http://grandprix247.com/2012/09/03/spa-pile-up-renews-focus-on-formula-1-safety-matters/

  • SEPTEMBER 20TH, 2015Also: April 21, 2011 - June 29, 2012 - October 22, 2012 - December 24, 2012 - August 26, 2013

    https://twitter.com/iamDeveloper/status/645659734767329281 https://aws.amazon.com/message/5467D2/

    https://twitter.com/iamDeveloper/status/645659734767329281https://aws.amazon.com/message/5467D2/

  • ISOLATION & CONTAINMENTIdeally limit failure to a single service

    Stop it from spreadinghttp://businessnerds.wordpress.com/2011/05/28/so-far-so-good-the-review/

    http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html

  • So#ware,)8)

    Automa/on,)4)

    Process,)14)

    #"of"Issu

    es"

    Amazon"Cloud"Major"Outage"7"Issues"Categories"

    https://steamcommunity.com/app/620/ http://fotos.subefotos.com/7a6b3e6df9453d5adf150087e5300834o.jpg

    AWS Root Cause Analysis over time

    http://www.slideshare.net/rahultyagi50999/amazon-cloud-major-outages-analysis

    http://fotos.subefotos.com/7a6b3e6df9453d5adf150087e5300834o.jpghttp://www.slideshare.net/rahultyagi50999/amazon-cloud-major-outages-analysis

  • Humans, Software, Processes

    All likely causes of failure

    Isolation Unlikely

    2 - 4x Yearly frequency of catastrophic failure

  • THERE ARE DOWNSIDEShttp://modernsavage.hubpages.com/hub/10-springfield-shopper-headlines

    http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html

  • Complex SystemsDifficult to model, not feasible to simulate at scale

  • Software is Iterativetesting, code coverage, agile

  • Resilience Design is also Iterativeunlike software, complexity makes testing difficult

  • Twitter

  • Rich Search ExperienceMany optional enhancements

  • http://usa.streetsblog.org/category/issues-campaigns/air-quality/

    NAVIGATING THE CHAOS

    http://usa.streetsblog.org/category/issues-campaigns/air-quality/

  • FALLBACK PATTERNSExpect the Unexpected

    http://blabitcanada.com/category/twitter-2/

    http://blabitcanada.com/category/twitter-2/

  • BASIC API CALL3 potential points of failure

  • FALLBACK PATTERNSThe cost of resilience should be accuracy or latency

    http://redis.io/ http://memcached.org/

    http://varnish-cache.org/

    http://varnish-cache.org/

  • ENSURING DATA ACCESS

    https://www.flickr.com/photos/ichijo2009/8501266124

    https://www.flickr.com/photos/ichijo2009/8501266124

  • CAP THEOREM APPLIESYour choice: sacrifice availability or consistency. Orange is a lie.

    RDBMS BigTable Based

    Master / Slave based

    CouchDB Dynamo Based

    http://ferd.ca/beating-the-cap-theorem-checklist.html

    http://www.americapictures.net/buenos-aires-traffic-city-night-argentina.html

  • SPLIT OUT YOUR CONTROL PLANE

    http://paul-barford.blogspot.com/2015/01/sappho-pap-obbink-further-painting-into.html

    http://paul-barford.blogspot.com/2015/01/sappho-pap-obbink-further-painting-into.html

  • EC2 EMR RDS

    Dynamo

    Cloudfront CDN

    Route53 DNS

    Cloudwatch Monitoring

  • Cloudfront CDN

    Route53 DNS

    Cloudwatch Monitoring

  • Control plane Separate

    from workload

    DNS & CDN Your best friends

    Latency or Accuracy

    Pick one to sacrificefor resilience

  • USER EXPERIENCEMy tweet got posted

  • http://mclaughlindrums.com/wp-content/uploads/2013/04/Relativity-by-Escher.jpg

    ORDERED CHAOS

    http://mclaughlindrums.com/wp-content/uploads/2013/04/Relativity-by-Escher.jpg

  • Nations Business, 1977

  • CHAOS DEFINED

    Intentionally introducing failure into a system with the purpose of validating resilience design.

  • http://www.cnbc.com/id/102394893

  • BREAKING THE SYSTEM

    How Confident are you?

    -Next week?

    -Next month?

    -After that quick patch

  • CHAOS VS OUTAGEChaos

    Controlled

    Planned

    Intentional

    Microscopic user impact

    Outages

    Uncontrolled

    Unpredictable

    Unintended

    Large impact

  • Single Point of FailureDiscover - Fix - Validate

  • CHAOS MONKEY

    http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.htmlhttps://github.com/Netflix/SimianArmy

  • 9am-5pm Mon-Fri Dont upset your on-call

    1 Instance Per group / per day

    Detect SPOF Intentionally

  • Slow is HardProduct + Business + Engineering Decisions

    https://pragprog.com/book/mnee/release-it

  • Custom Fallback

    accuracy or latency

    Fail Silent For optional data

    Fail Fast to keep servers healthy

  • LATENCY MONKEYother frameworks

    http://www.infoq.com/presentations/failure-as-a-service-netflix

    http://techblog.netflix.com/2014/10/fit-failure-injection-testing.html

  • HTTP 5xx 1 minute duration

    10-100ms Sleep during request

    1-100% Of requests

  • Prevent Propagationto avoid cascading failure

  • CHAOS KONGbecause regions fail

    http://techblog.netflix.com/2015/09/chaos-engineering-upgraded.html

  • GeoDNS fallback to LatencyDNS

    Proxy Cross-Region

    communication

    Capacity Cost-Benefit Decision

  • "ONCE IN A BLUE MOON"Happens at least a few times a year....

    https://whisperofangels.wordpress.com/2013/08/20/once-in-a-blue-moon/

    https://whisperofangels.wordpress.com/2013/08/20/once-in-a-blue-moon/

  • TAKE AWAYgo found chaos engineering at your company RIGHT

    NOW

  • Most enterprises hire people to fix things. Netflix hires people to break things.

    we should embrace Netflix's culture of "chaos engineering" throughout organizations of all shapes and sizes.

    http://readwrite.com/2014/09/17/netflix-chaos-engineering-for-everyone

    http://readwrite.com/2014/09/17/netflix-chaos-engineering-for-everyone

  • Q & A

    http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.html

    @bruce_m_wong / @jiboumansSlides - https://www.linkedin.com/in/brucemwong

    http://vickicaruana.blogspot.com/2011/01/are-you-afraid-to-raise-your-hand.htmlhttps://www.linkedin.com/in/brucemwong