failure the-good-parts

√FAILURE The Good Parts

Viktor Klang Director of Engineering

Build powerful, concurrent, resilient & distributed

software more easily.

”“

FAILURE The Bad Parts

Ariane 5 - 4 June 1996

๏ 10 years of research

๏ $7 billion invested

๏ Exploded within a minute of take-off

๏ Loss estimate $370 million

๏ Why?

๏ Trying to stuff a 64-bit float into a16-bit int

๏ o_O + wat

Failure is an option. A Some(failure)

to be exact. – me “”

Failure Recovery

#define Failure#undef Failure

Software fails

Runtime๏VM (OpenJDK Issue Tracker)

๏Drivers

๏Firmware

Runtime๏Overload/Exhaustion

๏Stack

๏Heap

๏FDs

๏…

๏Starvation

Hardware fails

"Related instructions that are affected by the bug are

FDIVP, FDIVR, FDIVRP, FIDIV, FIDIVR, FPREM, and FPREM1.

The instructions FPTAN and FPATAN are also susceptible"

http://en.wikipedia.org/wiki/Pentium_FDIV_bug

DRAM Errors in the Wild: A Large-Scale Field Study

Bianca Schroeder Dept. of Computer Science

University of Toronto Toronto, Canada

bianca@cs.toronto.edu

Eduardo Pinheiro Google Inc.

Mountain View, CA

Wolf-Dietrich Weber Google Inc.

Mountain View, CA

DRAM Errors in the wild๏Memory errors were between

15-120 times (!) more common than had previously been assumed.

๏More than 90% of the problems with a given platform were caused by about 20% of the machines who had errors.

DRAM Errors in the wild

(Credit: Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber)!http://news.cnet.com/8301-30685_3-10370026-264.html

DRAM Errors in the wild

๏Temperature didn't seem to make a big difference.

๏Irreparable problems were more common than transient problems.

๏Increased number of errors with age, setting in as early as 10-18 months in the field.

Failure Trends in a Large Disk Drive Population

Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso ´

Google Inc. 1600 Amphitheatre Pkwy Mountain View, CA 94043

{edpin,wolf,luiz}@google.com

Failure Trends by age

Failure Trends by utilization and age

The Network is ReliableLOL

Kyle Kingsbury's blog: !

http://aphyr.com/posts/288-the-network-is-reliable

Wetware fails

An expert is a man who has made all

the mistakes which can be made, in a

narrow field. – Niels Bohr

“”

Assumptions are bad

val result = something(x,y)

๏ Failure is unintentional๏ Validation is intentional

Validation vs Failure

Flows of information

๏ Results &Validation

๏ Failures & Recovery

๏ Don't complect them!

Attribution:

The Little

Vending Machine

That Could

Failure ValidationHandled

Outcome awareness

Known-Unknowns Unknown-Unknowns

Known-Knowns Unknown-Knowns

Failure awareness

Known-Unknowns Unknown-Unknowns

Known-Knowns Unknown-Knowns

๏ Result

๏ Invalid input

๏ Illegal value

๏ Illegal value combination

๏ Capability/Dependency violation

๏ Nothing

๏ Uninvoked

๏ Response lost

Possibilities

Program testing can be used to show the presence of bugs, but never to show their absence! !

– Edsger Dijkstra

“”

Testing & Checking๏ Testing is good for

๏ Known-Knowns

๏ Checking is good for

๏ Unknown-Knowns

๏ Known-Unknowns

๏ Unknown-Unknowns

๏ Conclusion

๏Use both!

val result = println(x,y)

Death & Delay & Distributed Programs

๏ There is no apparent difference between death and delay in a distributed system

๏ "Distributed programming is all about retries and timeouts"

๏ Without distribution you'll always have a SPOF

๏ … but the more hardware you have, the higher the risk of failures

A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.

!– Leslie Lamport

“”

Traditional Blocking RPC

๏What if: Request is lost

๏What if: Response is lost

๏Caller is held hostage by the Callee

๏… Stockholm Syndrome anyone?

http://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf

Defensive programming๏ "Paranoid programming"

๏ Mixes concerns

๏ Unclear responsibilities

๏ At best gives sense of false security

๏ Yields systems that fail extraordinarily

try { val breakfast = try { prepare(new Breakfast) } catch { case ex: OutOfJamError => … } finally { … } eat(breakfast) } catch { case ex: BreakfastOverflowError => … } finally { … }

Yes We Can

Make Failure

Management Fun

Distribution

Replication & Failover

CircuitBreakers

๏Benefits

๏Relieves pressure on failing parts

๏Are self-healing

๏Can be operated manually

Supervisors

๏ Components dealing with the failure of subcomponents

๏ Decouples failure from validation

๏ Makes it obvious who is responsible for what

Service

Superviso

Result/Validation

Failures / Recovery

Supervisors

Quis custodiet ipsos custodes? – Decimus Iunius Iuvenalis “”

Supervision

Bulkheading

๏Compartmentalization

๏Prevent failures from cascading

๏Plays well with redundancy & failover

An escalator can never break: it can only become stairs. You should never see an Escalator Temporarily Out Of Order sign, just Escalator Temporarily Stairs. Sorry for the convenience. !

– Mitch Hedberg

“”

Graceful degradation

My crystal ball

Microservices๏ Does one thing well

๏ Concurrent & Compartmentalized

๏ Location transparent

๏ Typed endpoints producing typed streams of data

๏ Exhibit compositionality

๏ Are async and non-blocking

๏ Support backpressure & flow control

Summary๏Failure management

๏… is not Validation

๏… need not be boring

๏… is not optional

๏There are real consequences

๏… and there are ways to avoid them!

“”

Don't worry—be happy. – Bobby McFerrin

Attribution: Steve Jurvetson

Thank you!๏ @viktorklang on Twitter

๏ viktor.klang@typesafe.com

๏ Want to know more?

๏ http://akka.io

๏ http://typesafe.com

๏ http://reactivemanifesto.org√

End of transmission…

failure the-good-parts

failure recovery

failure trends

failure awareness

intentional failure

failure of subcomponents

failure management fun

dram errors

wildmemory errors

Technology

github: the good parts

software process... the good parts

failure analysis of fluoropolymer parts – -...

boiler pressure parts & tube failure

13137707 javascript the good parts

spreadsheets: the good parts

javascript the good parts v2

manual tp-0445 parts failure analysis

javascript good parts - for novice programmers

azure - the good parts

manual tp-0445 parts failure analysis -...

case study – reducing premature failure of parts with ......

angular the good parts

plastic parts failure analysis & product liability...

tell me something good all parts

manual tp-0445 parts failure analysis -...

nasa parts “graveyard” examples of nasa eee parts...

java 8: the good parts!

failure analysis and parts evaluation - squarespace · pdf...

the good parts / the hard parts