@yakticus illusions of certainty co-founder, order of ......illusions of certainty what the brain...
TRANSCRIPT
Illusions of CertaintyWhat the brain can teach us about software engineering
Julie Pitt
Co-founder, Order of Magnitude Labs
@yakticus
github.com/yakticus/IllusionsOfCertainty
relevant links found here:
“For the things we have to learn before we can do them, we learn by doing
them.”
― Aristotle, The Nicomachean Ethics
today we will discuss
➔ a BIG reason why software projects are unpredictable
➔ how to help computers better understand what we mean
➔ how to make our software systems more resilient
➔ how to better understand what our software systems are doing
life
life: a generative model with an interface to the world
the world
senses
action
generative model
survival
workingworking
not working
software as a generative model
the world
input
output
the code
software as a generative model
the world
input
output
the code infinite precision
misjudging uncertainty in software
reality
perception
human precision
don’t kill humans
be nice to people
don’t hurt people
keep humans alive
respect human life
don’t kill humans
machine precision
don’t kill humans
be nice to people
don’t hurt people
keep humans alive
respect human life
machine precision
don’t kill humans
be nice to people
don’t hurt people
keep humans alive
respect human life
the cliffs of infinite precision the happy
path
utterly
broken
how do we get to this?
optimal
degraded
resil
ien
ce
ways we can cheat
➔ property tests
➔ remedy-first design
➔ build intuitive insight
property tests
test suite as a generative model
system
under test
y
x
test suite
individual test cases are often too precise
software system state space
desired behavior
tests(“training examples”)
testing an addition function: F# example
state space
credit: http://fsharpforfunandprofit.com/posts/property-based-testing/
✅test passes
overfitting to tests
state space
bug
credit: http://fsharpforfunandprofit.com/posts/property-based-testing/
property tests combat overfitting
state space
bug
credit: http://fsharpforfunandprofit.com/posts/property-based-testing/
property tests: let’s review- test suites are generative models
- describe the properties of your system
- requires less precision
- test the properties
remedy-first design
GET /api/metadata/12345
{“status”: “failure” “error”: { “errorCode”: 234 “description”: “database timeout” }
input outputRESTful service
client falls
off cliff
each error has a precise cause
read timeout
connect timeout
connection pool
exhausted
token expired
credentials
revoked
key rotation
endpoint moved
endpoints expired
failover
user error
insufficient
permissions
account problem
remedies are imprecise
read timeout
connect timeout
connection pool
exhausted
token expired
credentials
revoked
key rotation
endpoint moved
endpoints expired
failover
user error
insufficient
permissions
account problem
RETRYREDIRECT
DISPLAY
ERROR
RE-AUTHENT
ICATE
remedy tells the client how to ease pain
{“status”: “failure”“failure”: { “action”: “RETRY” “error”: { “errorCode”: 234 “description”: “database timeout” }}
remedy
(actionable)
What about failures that weren’t predicted?
AWS outage - 2012/10/22 -> DNS change didn’t propagate
-> indirectly triggered a latent memory leak
-> insufficient alerting; failovers happened too little, too late
-> API throttling affected some customers more than others
-> many popular internet services down for hours
failure comes in many forms
AWS scheduled maintenance - 2014/09/25 -> time-sensitive security update on 10% of EC2 nodes
-> required reboot of those nodes
-> possible impact to customer applications running on those nodes
failure comes in many forms
AWS DynamoDB outage - 2015/09/20 -> DynamoDB failed in us-east-1 region
-> dozens of dependent services also failed
-> many prominent internet services were taken down for hours
failure comes in many forms
Netflix was prepared
meet simian army- OSS project by Netflix
- deliberately cause failures in a controlled
manner
- e.g., randomly takes down AWS ec2
nodes, datacenter, or region
- validate whether the system handles
failure
simian army -> cultural change- failure is the norm
- simulates the nature of failure and
not the cause
- we can’t predict all causes of
failure
remedy-first design: let’s review- design with remedies in mind
- # remedies << # causes
- test resilience during business hours
- find out what you’re up against when wide awake
- use a tool that is agnostic to causes
- e.g., simian army
intuitive feedback
is it working?
logs: easy to produce
logs: hard to consume
charts
charts: easier to consume, but still hard
we want the whole picture
solution: leverage our intuition
thought experimentWhat if your software system’s interactions sounded like cars on the road?
intuitive feedback: let’s review- humans want to know “is it working”?
- the tools of today inhibit us from seeing the big picture
- we need tools that leverage our intuition
- e.g., vizceral & TBD
conclusion
curiosity-driven tests
system
under test
senses
action
test agent
(neural network)
mapping the state space through exploration
state space
begin testing random
states without
expectations
mapping the state space through exploration
state space
gradually build a
model containing
expectations
mapping the state space through exploration
state space
model capable of
recognizing
anomalies
self-healing systems
telemetry
senses
action
ops agent
(neural network)
deployment,
scaling,
failover, etc.
workingworking
not working
let’s review
goal: change the landscape
the end.
links
github.com/yakticus/IllusionsOfCertainty