@yakticus illusions of certainty co-founder, order of ......illusions of certainty what the brain...

Illusions of CertaintyWhat the brain can teach us about software engineering

Julie Pitt

Co-founder, Order of Magnitude Labs

@yakticus

http://twitter.com/yakticus

http://twitter.com/yakticus

github.com/yakticus/IllusionsOfCertainty

relevant links found here:

http://github.com/yakticus/IllusionsOfCertainty


“For the things we have to learn before we can do them, we learn by doing

them.”

― Aristotle, The Nicomachean Ethics

today we will discuss

➔ a BIG reason why software projects are unpredictable

➔ how to help computers better understand what we mean

➔ how to make our software systems more resilient

➔ how to better understand what our software systems are doing

life: a generative model with an interface to the world

the world

senses

action

generative model

survival

workingworking

not working

software as a generative model

the world

input

output

the code

software as a generative model

the world

input

output

the code infinite precision

misjudging uncertainty in software

reality

perception

human precision

don’t kill humans

be nice to people

don’t hurt people

keep humans alive

respect human life

don’t kill humans

machine precision

don’t kill humans

be nice to people

don’t hurt people

keep humans alive

respect human life

the cliffs of infinite precision the happy

path

utterly

broken

how do we get to this?

optimal

degraded

resil

ien

ce

ways we can cheat

➔ property tests

➔ remedy-first design

➔ build intuitive insight

property tests

test suite as a generative model

system

under test

y

x

test suite

individual test cases are often too precise

software system state space

desired behavior

tests(“training examples”)

testing an addition function: F# example

state space

credit: http://fsharpforfunandprofit.com/posts/property-based-testing/

✅test passes

http://fsharpforfunandprofit.com/posts/property-based-testing/

overfitting to tests

state space

bug



property tests combat overfitting

state space

bug



property tests: let’s review- test suites are generative models

- describe the properties of your system

- requires less precision

- test the properties

remedy-first design

GET /api/metadata/12345

{“status”: “failure” “error”: { “errorCode”: 234 “description”: “database timeout” }

input outputRESTful service

client falls

off cliff

each error has a precise cause

read timeout

connect timeout

connection pool

exhausted

token expired

credentials

revoked

key rotation

endpoint moved

endpoints expired

failover

user error

insufficient

permissions

account problem

remedies are imprecise

read timeout

connect timeout

connection pool

exhausted

token expired

credentials

revoked

key rotation

endpoint moved

endpoints expired

failover

user error

insufficient

permissions

account problem

RETRYREDIRECT

DISPLAY

ERROR

RE-AUTHENT

ICATE

remedy tells the client how to ease pain

{“status”: “failure”“failure”: { “action”: “RETRY” “error”: { “errorCode”: 234 “description”: “database timeout” }}

remedy

(actionable)

What about failures that weren’t predicted?

AWS outage - 2012/10/22 -> DNS change didn’t propagate

-> indirectly triggered a latent memory leak

-> insufficient alerting; failovers happened too little, too late

-> API throttling affected some customers more than others

-> many popular internet services down for hours

failure comes in many forms

https://aws.amazon.com/message/680342/

https://aws.amazon.com/message/680342/

AWS scheduled maintenance - 2014/09/25 -> time-sensitive security update on 10% of EC2 nodes

-> required reboot of those nodes

-> possible impact to customer applications running on those nodes


https://aws.amazon.com/blogs/aws/ec2-maintenance-update/

https://aws.amazon.com/blogs/aws/ec2-maintenance-update/

AWS DynamoDB outage - 2015/09/20 -> DynamoDB failed in us-east-1 region

-> dozens of dependent services also failed

-> many prominent internet services were taken down for hours


https://aws.amazon.com/message/5467D2/

https://aws.amazon.com/message/5467D2/

Netflix was prepared

meet simian army- OSS project by Netflix

- deliberately cause failures in a controlled

manner

- e.g., randomly takes down AWS ec2

nodes, datacenter, or region

- validate whether the system handles

failure

simian army -> cultural change- failure is the norm

- simulates the nature of failure and

not the cause

- we can’t predict all causes of

failure

remedy-first design: let’s review- design with remedies in mind

- # remedies << # causes

- test resilience during business hours

- find out what you’re up against when wide awake

- use a tool that is agnostic to causes

- e.g., simian army

intuitive feedback

is it working?

logs: easy to produce

logs: hard to consume

charts

charts: easier to consume, but still hard

we want the whole picture

solution: leverage our intuition

http://www.youtube.com/watch?v=KVbTjlZ0sfE

http://www.youtube.com/watch?v=MYHf_BXWuOc

thought experimentWhat if your software system’s interactions sounded like cars on the road?

intuitive feedback: let’s review- humans want to know “is it working”?

- the tools of today inhibit us from seeing the big picture

- we need tools that leverage our intuition

- e.g., vizceral & TBD

conclusion

curiosity-driven tests

system

under test

senses

action

test agent

(neural network)

mapping the state space through exploration

state space

begin testing random

states without

expectations


state space

gradually build a

model containing

expectations


state space

model capable of

recognizing

anomalies

self-healing systems

telemetry

senses

action

ops agent

(neural network)

deployment,

scaling,

failover, etc.

workingworking

not working

let’s review

goal: change the landscape

the end.

links

github.com/yakticus/IllusionsOfCertainty



@yakticus illusions of certainty co-founder, order of ......illusions of certainty what the brain...

Documents