antifragility and testing for distributed systems failure

Antifragility and testing distributed systemsApproaches for testing and improving resiliency

FailureIt’s inevitable

Microservice Architectures

■ Bounded contexts■ Deterministic in nature■ Simple behaviour■ Independently testable (e.g. Pact)

Distributed Architectures

Conversely…

■ Unbounded context■ Non-determinism■ Exhibit chaotic behaviour■ Emergent behaviour■ Complex testing

Problems with traditional approaches

■ Integration test hell■ Need to get by without E2E environments■ Learnings are non-representative anyway■ Slower■ Costly (effort + $$)

Alternative?

Create an isolated, simulated environment

■ Run locally or on a CI environment■ Fast - no need to setup complex test data, scenarios etc.■ Enables single-variable hypothesis testing■ Automatable

Lab Testing w\ Docker ComposeHypothesis testing simulated environments

Docker Compose

■ Docker container orchestration tool■ Run locally or remotely■ Works across platforms (Windows, Mac, *nix)■ Easy to use

Let’s take a practical, real-world example: Nginx as an API Proxy.

Simulating failure with Muxy

“A tool to help simulate distributed systems failures”

Hypothesis testing

Our job is to hypothesise, test, learn, change, and repeat

Nginx TestingH0 = Introducing network latency does not cause errors

Test setup:

● Nginx running locally, with Production configuration● DNSMasq used to resolve production urls to other Docker

containers● Muxy container setup, proxying the API● A test harness to hit the API via Nginx n times, expecting

0 failures

Fingers crossed...

Knobs and Levers

We can now have a number of levers to pull. What if we...

● Want to improve on our SLA?● Want to see how it performs if the API is hard down?● ...

AntifragilityFailure is inevitable, let’s make it normal

Titanic Architectures

Architectures

Titanic Architectures

“Titanic architectures are architectures that are good in theory, but haven’t been put into practice”

Anti-titanic architectures?

“What doesn’t kill you makes you stronger”

Antifragility

“The resilient resists shocks and stays the same; the antifragile gets better” - Nasim Taleb

Chaos Engineering

● We expect our teams to build resilient applications○ Fault tolerance across and within service boundaries

● We expect servers and dependent services to fail● Let’s make that normal● Production is a playground● Levelling up

Chaos Engineering - Principles

1. Build a hypothesis around Steady State Behavior2. Vary real-world events3. Run experiments in production4. Automate experiments to run continuously

Requires the ability to measure - you need metrics!!

http://www.principlesofchaos.org/

Production Hypothesis Testing

H0 = Loss of an AWS region does not result in errors

Test setup:

● Multi-region application setup for the video playing API● Apply Chaos Kong to us-west-2● Measure aggregate production traffic for ‘normal’ levels

Kill an AWS region

http://techblog.netflix.com/2015/09/chaos-engineering-upgraded.html

Go/Hystrix API Demo

H0 = Introducing network latency does not cause API errors

Test setup:

● API1 running with Hystrix circuit breaker enabled if API2 does not respond within SLAs

● Muxy container setup, proxying upstream API2● A test harness to hit API1 n times, expecting 0 failures

Human FactorsTechnology is only part of the problem, can we test that too?

Chernobyl

● Worst nuclear disaster of all time (1986)● Public information sketchy● Estimated > 3M Ukrainians affected● Radioactive clouds sent over Europe● Combination of system + human errors● Series of seemingly logical steps ->

catastrophe

What we know about human factors

● Accidents happen● 1am - 8am = higher incidence of human errors● Humans will ignore directions

○ They sometimes need to (e.g. override)○ Other times they think they need to

(mistake)● Computers are better at following processes

Let’s use a Production deployment as a key example:

● CI -> CD pipeline used to deploy● Production incident occurs 6 hours later (2am)● ...what do we do?● We trust the build pipeline, avoid non-standard

actions

These events help us understand and improve our systems

Translation

“ A game day exercise is where we intentionally try to break our system, with the goal of being able to understand it better and learn from it ”

Game Day Exercises

Prerequisites:

● A game plan● All team members and affected staff aware of it● Close collaboration between Dev, Ops, Test,

Product people etc.● An open mind● Hypotheses● Metrics● Bravery

Game Day Exercises

● Get entire team together● Make a simple diagram of system on a

whiteboard● Come up with ~5 failure scenarios● Write down hypotheses for each scenario● Backup any data you can’t lose● Induce each failure and observe the results

Game Day Exercises

https://stripe.com/blog/game-day-exercises-at-stripe

Examples of things that fail:

● Application dies● Hard disk fail● Machine dies < AZ < Region…● Github/Source control goes down● Build server dies● Loss of \ degraded network connectivity● Loss of dependent API● ...

Game Day Exercises

Wrapping upI hope I didn’t fail

■ Apply the scientific method■ Use metrics to make learn and make decisions■ Docker-compose + Muxy to automate failure ■ Build resilience into software & architecture■ Regularly Production resilience until it’s normal■ Production outages are opportunities to learn■ Start small!

Wrapping up

Thank you

PRESENTED BY:

@matthewfellows

■ Antifragility (https://en.wikipedia.org/wiki/Antifragile) ■ Chaos Engineering (

http://techblog.netflix.com/2014/09/introducing-chaos-engineering.html)

■ Principles of Chaos (http://www.principlesofchaos.org/)■ Human factors in large-scale technological systems'

accidents: Three Mile Island, Bhopal, Chernobyl (http://oae.sagepub.com/content/5/2/133.abstract)

References

■ Docker Compose (https://www.docker.com/docker-compose)

■ Muxy (https://github.com/mefellows/muxy)■ Nginx resilience testing with Docker Compose (

www.onegeek.com.au/articles/resilience-testing-nginx-with-docker-dnsmasq-and-muxy)

■ Golang + Hystrix resilience testing with Docker Compose (https://github.com/mefellows/muxy/tree/mst-meetup-demo/examples/hystrix)

Code \ Tool References

antifragility and testing for distributed systems failure

Internet

failure tolerancetschwarz/coen 317/failure.pdf · •...

unreliable failure detectors for reliable distributed...

antifragility distilled: beyond agility · antifragility!...

failure characterization and error detection in distributed...

antifragility - devops melbourne november 2013

application resilience and antifragility from the …...en28...

ecosystem antifragility: beyond integrity and...

unreliable failure detectors for reliable distributed...

failure handling in a network-transparent distributed

fault tolerance in distributed systems: an...

unreliable failure detectors for reliable distributed...

iso 9001-2008 project gap analysis antifragility

shear failure of reinforced distributed loading. james j

chaos kong endowing netflix with antifragility

distributed systems distributed system modelsdistributed...

affordable access, antifragility, and skin in the game

net developer days 2015, pl: defensive programming,...

ceph distributed file system: simulating a site failure

2015 - network 2015, ua: defensive programming, resilience...

fast failure recovery in distributed graph processing...