madaari monkeys ordering for the - usenix · joint work between ebay and disorderly labs dr. peter...
TRANSCRIPT
![Page 1: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/1.jpg)
Madaari Ordering For The
Monkeys
![Page 2: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/2.jpg)
Agenda● Distributed Systems and Chaos Engineering : State Of The Union● Lineage Driven Fault Injection : A Brief Primer● LDFI : Ordering Of Faults● Bringing LDFI to the Enterprise● Results● Future Work
3
![Page 3: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/3.jpg)
Industry + Academia = Win !!Joint work between eBay and Disorderly Labs
● Dr. Peter Alvaro ( UCSC )● Kamala Ramasubramanian ( UCSC )● eBay SRE Team
Madaari : a trainer who teaches a monkey to perform tricks
4
![Page 4: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/4.jpg)
The Problem : Testing Distributed SystemsCombinatorial Space of FailuresMicroservices Death Star
Consider 100 Services
Fault Search Space : 2100
5
Fault Cardinality
Possible Faults
1 100
4 3 Million
![Page 5: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/5.jpg)
Chaos Engineering : A Possible Solution● Failure is inevitable, let’s fail in a controlled environment● Proactively inject failure in your system to reveal weaknesses● Perturbation and observation of large-scale systems
6
![Page 6: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/6.jpg)
Chaos Engineering : A Brief Primer
Doesn’t scale well !!
7
A genius holds the mental model of the system
Guided Fault Injection
No Model Of The System
Random Fault Injection
Can’t quantify progress
![Page 7: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/7.jpg)
Lineage Driven Fault Injection aka LDFICLAIM : Fault Tolerance = Redundancy
● Use explanations of successful outcomes to search for faults that can drive the system into a bad state
● Observing successful executions enables LDFI to build a model of the redundancy of the system
8
![Page 8: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/8.jpg)
LDFI : Building BlocksRecipe:
1. Start with a successful outcome. Work
backwards.
2. Ask why it happened ? Ans. Lineage (Traces)
3. Convert lineage to a CNF formula and solve the
decision problem ( using a SAT solver )
4. Lather, rinse, repeat
9
![Page 9: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/9.jpg)
Encoding the Lineage(A v B v C v D v E)
10
A
B C
ED
(A v C v D v E)
(A v B v C v D v E) ^ (A v C v D v E)
A
C
D E
B
![Page 10: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/10.jpg)
Ordering For Faults : Injecting Faults That Matter● Drawbacks of existing approach
○ LDFI (using SAT) reduces the search space but the search space might still be still large
○ LDFI is a decision problem, solutions are returned in no particular order● We want to order solutions to:
○ Find the most likely faults before users do!○ Reduce the search space as much as possible
11
![Page 11: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/11.jpg)
Ordering Faults : Injecting Faults That Matter
12
LDFI assumes all faults are equally likely, the reality differs !!
Intuition : Some faults are more likely than others; incident history usually backs this claim
We want to encode our intuition of failure in LDFI
A
BC
ED
F
![Page 12: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/12.jpg)
Ordering Faults : Injecting Faults That Matter
13
A
B
Use the structure of the Trace to prune the Solution Space : 1. Rank Of the Service ( distance from the root )2. Size Of the sub graph of the Service3. If we survive the failure of C, we will surely survive the failure
of D, E and F
A
BC
ED
F
![Page 13: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/13.jpg)
Ordering Faults : Injecting Faults That Matter● All services are not created equal, some services fail more than others● Likelihood and Containment :
○ P(Node failure) > P(Rack Failure) >> P(Data center failure)
● Historical measures :○ Time since last release○ History Of Failure and Bug Rate
14
![Page 14: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/14.jpg)
LDFI in the Enterprise
ExplanationsModels Of Redundancy Fault Injection
15
![Page 15: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/15.jpg)
Traces = Explanations● Distributed Tracing
○ Call graphs come for free● Less Ideal (but OK) : Structured Logging
○ We did this too !!
16
What are traces anyway ?
○ Ordered Events with context stitched together
○ Create the call graphs using service names and endpoints
![Page 16: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/16.jpg)
Fault Injection Tool● We rolled our own ( Mowgli )
○ Circuit breaker aware fault injection tool, deals with services and databases
○ Built in safety mechanisms○ Hooks for AZ level, node level fault
injection○ Audit and Tracking capabilities
17
● Lots of open source options available○ Start simple, a script to drop network
traffic is also OK○ https://github.com/dastergon/awesome-chaos-engi
neering ● Tip : Be safe by default
○ Always have a rollback strategy
![Page 17: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/17.jpg)
Interaction Replay● Ability to replay interactions ( Tip : E2E Tests )● Measure of Success
○ A unique binary (yes or no ) way of saying whether the execution was successful or not
● Works for Eventually Consistent systems as well, as long as there is finite upper bound on the eventuality
18
![Page 18: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/18.jpg)
LDFI in the Enterprise
Traces/Structured Logs LDFI FIT Tool
To Call Graphs
Encode For The Solver Fault
Suggestion
● PyCoSAT● PULP● SAT4J
19
![Page 19: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/19.jpg)
Results : Finding Bugs
20
![Page 20: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/20.jpg)
Comparison With Chaos Monkey
21
Strategy Fault Experiment Runs (avg.)
Standard Deviation
Ordered LDFI 17 0
Uniform Random 210.35 111.42
How long did it take to find those 5 bugs? A few hours
(An experiment takes ~2 minute, and we did retries to get around our infrastructure)
![Page 21: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/21.jpg)
Results : Finding No Bugs
22
![Page 22: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/22.jpg)
Madaari : The Road Ahead● Scalarizing Probabilities of Failure● SLA verification using strategic Delay Injection● Fine Grained Fault Injection● Reason about Stateful systems● Microservices Only ?
○ Databases, Containers, Service Mesh .. Let’s Go !!
23
![Page 23: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/23.jpg)
LDFI : The Road Ahead3 W’s For Fault Injection
1. What to inject ? ( type of fault we want to inject )
2. Where to inject ? (the target component )
3. When to inject ? ( inject when there are exactly 5 items in the cart !! )
24
![Page 24: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/24.jpg)
LDFI : The Road AheadA Journey from Time to State and back
1. What’s time anyway ??2. Applications have state and change of state gives you implicit order.3. A rendezvous of state and time gives us precision for fault injection.
25
![Page 25: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/25.jpg)
Madaari : Key Takeaways● Industry and Academia can work together for fun(d) and profit
● Limitations of LDFI w.r.t unordered solutions and why ordering matters for chaos engineering
experiments
● Understand how LDFI can be integrated in the enterprise by harnessing the observability
infrastructure
● Preliminary results of prioritized LDFI and a future direction for the community
● Evangelising new techniques is hard; start small and stay simple
26
![Page 26: Madaari Monkeys Ordering For The - USENIX · Joint work between eBay and Disorderly Labs Dr. Peter Alvaro ( UCSC ) ... Microservices Death Star Combinatorial Space of Failures Consider](https://reader034.vdocuments.mx/reader034/viewer/2022042304/5ecf3449875e2c2d8330da65/html5/thumbnails/26.jpg)
Discussion
27