failure the-good-parts
TRANSCRIPT
![Page 1: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/1.jpg)
√FAILURE The Good Parts
Viktor Klang Director of Engineering
![Page 2: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/2.jpg)
�2
Build powerful, concurrent, resilient & distributed
software more easily.
”“
![Page 3: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/3.jpg)
FAILURE The Bad Parts
![Page 4: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/4.jpg)
Ariane 5 - 4 June 1996
๏ 10 years of research
๏ $7 billion invested
๏ Exploded within a minute of take-off
๏ Loss estimate $370 million
๏ Why?
๏ Trying to stuff a 64-bit float into a16-bit int
๏ o_O + wat
![Page 5: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/5.jpg)
Failure is an option. A Some(failure)
to be exact. – me “”
![Page 6: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/6.jpg)
Failure Recovery
![Page 7: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/7.jpg)
#define Failure#undef Failure
![Page 8: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/8.jpg)
Software fails
![Page 9: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/9.jpg)
Runtime๏VM (OpenJDK Issue Tracker)
๏OS
๏Drivers
๏Firmware
![Page 10: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/10.jpg)
Runtime๏Overload/Exhaustion
๏Stack
๏Heap
๏FDs
๏…
๏Starvation
![Page 11: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/11.jpg)
Hardware fails
![Page 12: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/12.jpg)
CPUs
"Related instructions that are affected by the bug are
FDIVP, FDIVR, FDIVRP, FIDIV, FIDIVR, FPREM, and FPREM1.
The instructions FPTAN and FPATAN are also susceptible"
http://en.wikipedia.org/wiki/Pentium_FDIV_bug
![Page 13: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/13.jpg)
RAM
![Page 14: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/14.jpg)
DRAM Errors in the Wild: A Large-Scale Field Study
Bianca Schroeder Dept. of Computer Science
University of Toronto Toronto, Canada
Eduardo Pinheiro Google Inc.
Mountain View, CA
Wolf-Dietrich Weber Google Inc.
Mountain View, CA
![Page 15: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/15.jpg)
DRAM Errors in the wild๏Memory errors were between
15-120 times (!) more common than had previously been assumed.
๏More than 90% of the problems with a given platform were caused by about 20% of the machines who had errors.
![Page 16: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/16.jpg)
DRAM Errors in the wild
(Credit: Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber)!http://news.cnet.com/8301-30685_3-10370026-264.html
![Page 17: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/17.jpg)
DRAM Errors in the wild
๏Temperature didn't seem to make a big difference.
๏Irreparable problems were more common than transient problems.
๏Increased number of errors with age, setting in as early as 10-18 months in the field.
![Page 18: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/18.jpg)
HDDs
![Page 19: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/19.jpg)
Failure Trends in a Large Disk Drive Population
Eduardo Pinheiro, Wolf-Dietrich Weber and Luiz Andre Barroso ´
Google Inc. 1600 Amphitheatre Pkwy Mountain View, CA 94043
{edpin,wolf,luiz}@google.com
![Page 20: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/20.jpg)
Failure Trends by age
![Page 21: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/21.jpg)
Failure Trends by utilization and age
![Page 22: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/22.jpg)
The Network is ReliableLOL
Kyle Kingsbury's blog: !
http://aphyr.com/posts/288-the-network-is-reliable
![Page 23: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/23.jpg)
Wetware fails
![Page 24: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/24.jpg)
An expert is a man who has made all
the mistakes which can be made, in a
narrow field. – Niels Bohr
“”
![Page 25: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/25.jpg)
Assumptions are bad
![Page 26: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/26.jpg)
Quiz
val result = something(x,y)
![Page 27: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/27.jpg)
๏ Failure is unintentional๏ Validation is intentional
Validation vs Failure
![Page 28: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/28.jpg)
Flows of information
๏ Results &Validation
๏ Failures & Recovery
๏ Don't complect them!
Attribution:
![Page 29: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/29.jpg)
The Little
Vending Machine
That Could
![Page 30: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/30.jpg)
Failure ValidationHandled
![Page 31: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/31.jpg)
Outcome awareness
Known-Unknowns Unknown-Unknowns
Known-Knowns Unknown-Knowns
![Page 32: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/32.jpg)
Failure awareness
Known-Unknowns Unknown-Unknowns
Known-Knowns Unknown-Knowns
![Page 33: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/33.jpg)
๏ Result
๏ Invalid input
๏ Illegal value
๏ Illegal value combination
๏ Capability/Dependency violation
๏ Nothing
๏ Uninvoked
๏ Response lost
Possibilities
![Page 34: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/34.jpg)
Program testing can be used to show the presence of bugs, but never to show their absence! !
– Edsger Dijkstra
“”
![Page 35: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/35.jpg)
Testing & Checking๏ Testing is good for
๏ Known-Knowns
๏ Checking is good for
๏ Unknown-Knowns
๏ Known-Unknowns
๏ Unknown-Unknowns
๏ Conclusion
๏Use both!
![Page 36: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/36.jpg)
Quiz
val result = println(x,y)
![Page 37: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/37.jpg)
Death & Delay & Distributed Programs
๏ There is no apparent difference between death and delay in a distributed system
๏ "Distributed programming is all about retries and timeouts"
๏ Without distribution you'll always have a SPOF
๏ … but the more hardware you have, the higher the risk of failures
![Page 38: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/38.jpg)
A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.
!– Leslie Lamport
“”
![Page 39: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/39.jpg)
Traditional Blocking RPC
๏What if: Request is lost
๏What if: Response is lost
๏Caller is held hostage by the Callee
๏… Stockholm Syndrome anyone?
http://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf
![Page 40: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/40.jpg)
Defensive programming๏ "Paranoid programming"
๏ Mixes concerns
๏ Unclear responsibilities
๏ At best gives sense of false security
๏ Yields systems that fail extraordinarily
![Page 41: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/41.jpg)
!
try { val breakfast = try { prepare(new Breakfast) } catch { case ex: OutOfJamError => … } finally { … } eat(breakfast) } catch { case ex: BreakfastOverflowError => … } finally { … }
![Page 42: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/42.jpg)
Yes We Can
Make Failure
Management Fun
![Page 43: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/43.jpg)
Distribution
![Page 44: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/44.jpg)
Replication & Failover
![Page 45: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/45.jpg)
CircuitBreakers
![Page 46: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/46.jpg)
CircuitBreakers
๏Benefits
๏Relieves pressure on failing parts
๏Are self-healing
๏Can be operated manually
![Page 47: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/47.jpg)
Supervisors
๏ Components dealing with the failure of subcomponents
๏ Decouples failure from validation
๏ Makes it obvious who is responsible for what
![Page 48: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/48.jpg)
Service
Superviso
Input
Result/Validation
Failures / Recovery
Supervisors
![Page 49: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/49.jpg)
Quis custodiet ipsos custodes? – Decimus Iunius Iuvenalis “”
Supervision
![Page 50: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/50.jpg)
Bulkheading
๏Compartmentalization
๏Prevent failures from cascading
๏Plays well with redundancy & failover
![Page 51: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/51.jpg)
An escalator can never break: it can only become stairs. You should never see an Escalator Temporarily Out Of Order sign, just Escalator Temporarily Stairs. Sorry for the convenience. !
– Mitch Hedberg
“”
Graceful degradation
![Page 52: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/52.jpg)
My crystal ball
![Page 53: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/53.jpg)
Microservices๏ Does one thing well
๏ Concurrent & Compartmentalized
๏ Location transparent
๏ Typed endpoints producing typed streams of data
๏ Exhibit compositionality
๏ Are async and non-blocking
๏ Support backpressure & flow control
![Page 54: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/54.jpg)
Summary๏Failure management
๏… is not Validation
๏… need not be boring
๏… is not optional
๏There are real consequences
๏… and there are ways to avoid them!
![Page 55: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/55.jpg)
“”
Don't worry—be happy. – Bobby McFerrin
Attribution: Steve Jurvetson
![Page 56: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/56.jpg)
Thank you!๏ @viktorklang on Twitter
๏ Want to know more?
๏ http://akka.io
๏ http://typesafe.com
๏ http://reactivemanifesto.org√
![Page 57: Failure the-good-parts](https://reader035.vdocuments.mx/reader035/viewer/2022081519/554fb0bbb4c905ad218b5221/html5/thumbnails/57.jpg)
End of transmission…