reliable distributed systems
DESCRIPTION
Reliable Distributed Systems. How and Why Complex Systems Fail. How and Why Systems Fail. We’ve talked about Transactional reliability And we’ve mentioned replication for high availability But does this give us “fault-tolerant solutions?” How and why do real systems fail? - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/1.jpg)
Reliable Distributed Systems
How and Why Complex Systems Fail
![Page 2: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/2.jpg)
How and Why Systems Fail We’ve talked about
Transactional reliability And we’ve mentioned replication for high
availability But does this give us “fault-tolerant
solutions?” How and why do real systems fail? Do real systems offer the hooks we’ll
need to intervene?
![Page 3: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/3.jpg)
Failure Failure is just one of the aspects of
reliability, but it is clearly an important one
To make a system fault-tolerant we need to understand how to detect failures and plan an appropriate response if a failure occurs
This lecture focuses on how systems fail, how they can be “hardened”, and what still fails after doing so
![Page 4: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/4.jpg)
Systems can be built in many ways Reliability is not always a major goal
when development first starts Most systems evolve over time, through
incremental changes with some rewriting
Most reliable systems are entirely rewritten using clean-room techniques after they reach a mature stage of development
![Page 5: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/5.jpg)
Clean-room concept Based on goal of using “best available”
practice Requires good specifications Design reviews in teams Actual software also reviewed for correctness Extensive stress testing and code coverage
testing, use tools like “Purify” Use of formal proof tools where practical
![Page 6: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/6.jpg)
But systems still fail! Gray studied failures in Tandem
systems Hardware was fault-tolerant and
rarely caused failures Software bugs, environmental
factors, human factors (user error), incorrect specification were all major sources of failure
![Page 7: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/7.jpg)
Bohrbugs and Heisenbugs Classification proposed by Bruce Lindsey Bohrbug: like the Bohr model of the nucleus:
solid, easily reproduced, can track it down and fix it
Heisenbug: like the Heisenberg nucleus: a diffuse cloud, very hard to pin down and hence fix
Anita Borr and others have studied life-cycle bugs in complex software using this classification
![Page 8: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/8.jpg)
Programmer facing bugs
Heisenbug is fuzzy,hard to find/fix
Bohrbug is solid,easy to recognize and fix
?
![Page 9: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/9.jpg)
Lifecycle of Bohrbug Usually introduced in some form of code
change or in original design Often detected during thorough testing Once seen, easily fixed Remain a problem over life-cycle of
software because of need to extend system or to correct other bugs.
Same input will reliably trigger the bug!
![Page 10: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/10.jpg)
Lifecycle of Bohrbug
A Bohrbug is boring.
![Page 11: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/11.jpg)
Lifecycle of a Heisenbug These are often side-effects of some
other problem Example: bug corrupts a data structure
or misuses a pointer. Damage is not noticed right away, but causes a crash much later when structure is referenced
Attempting to detect the bug may shift memory layout enough to change its symptoms!
![Page 12: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/12.jpg)
How programmers fix a Bohrbug They develop a test scenario that
triggers it Use a form of binary search to narrow in
on it Pin down the bug and understand
precisely what is wrong Correct the algorithm or the coding
error Retest extensively to confirm that the
bug is fixed
![Page 13: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/13.jpg)
How they fix Heisenbugs They fix the symptom: periodically scan the
structure that is ususally corrupted and clean it up
They add self-checking code (which may itself be a source of bugs)
They develop theories of what is wrong and fix the theoretical problem, but lack a test to confirm that this eliminated the bug
These bugs are extremely sensitive to event orders
![Page 14: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/14.jpg)
Bug-free software is uncommon Heavily used software may become
extremely reliable over its life (the C compiler rarely crashes, UNIX is pretty reliable by now)
Large, complex systems depend upon so many components, many complex, that bug freedom is an unachievable goal
Instead, adopt view that bugs will happen and we should try and plan for them
![Page 15: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/15.jpg)
Bugs in a typical distributed system Usual pattern: some component crashes
or becomes partitioned away Other system components that depend
on it freeze or crash too Chains of dependencies gradually cause
more and more of the overall system to fail or freeze
![Page 16: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/16.jpg)
Tools can help Everyone should use tools like “purify”
(detects stray pointers, uninitialized variables and memory leaks)
But these tools don’t help at the level of a distributed system
Benefit of a model, like transactions or virtual synchrony, is that the model simplifies developer’s task
![Page 17: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/17.jpg)
Leslie Lamport
“A distributed system is one in which the failure of a machine you have never heard of can cause your own machine to become unusable”
Issue is dependency on critical components
Notion is that state and “health” of system at site A is linked to state and health at site B
![Page 18: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/18.jpg)
Component Architectures Make it Worse
Modern systems are structured using object-oriented component interfaces: CORBA, COM (or DCOM), Jini XML
In these systems, we create a web of dependencies between components
Any faulty component could cripple the system!
![Page 19: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/19.jpg)
Reminder: Networks versus Distributed Systems
Network focus is on connectivity but components are logically independent: program fetches a file and operates on it, but server is stateless and forgets the interaction Less sophisticated but more robust?
Distributed systems focus is on joint behavior of a set of logically related components. Can talk about “the system” as an entity. But needs fancier failure handling!
![Page 20: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/20.jpg)
Component Systems? Includes CORBA and Web Services These are distributed in the sense of our
definition Often, they share state between
components If a component fails, replacing it with a new
version may be hard Replicating the state of a component: an
appealing option… Deceptively appealing, as we’ll see
![Page 21: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/21.jpg)
Thought question Suppose that a distributed system was
built by interconnecting a set of extremely reliable components running on fault-tolerant hardware
Would such a system be expected to be reliable?
![Page 22: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/22.jpg)
Thought question Suppose that a distributed system was built by
interconnecting a set of extremely reliable components running on fault-tolerant hardware
Would such a system be expected to be reliable?
Perhaps not. The pattern of interaction, the need to match rates of data production and consumption, and other “distributed” factors all can prevent a system from operating correctly!
![Page 23: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/23.jpg)
Example The Web components are individually reliable But the Web can fail by returning inconsistent
or stale data, can freeze up or claim that a server is not responding (even if both browser and server are operational), and it can be so slow that we consider it faulty even if it is working
For stateful systems (the Web is stateless) this issue extends to joint behavior of sets of programs
![Page 24: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/24.jpg)
Example The Arianne rocket is designed in a
modular fashion Guidance system Flight telemetry Rocket engine control …. Etc
When they upgraded some rocket components in a new model, working modules failed because hidden assumptions were invalided.
![Page 25: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/25.jpg)
Arianne Rocket
Guidance
Thrust Control
Attitude Control
Accelerometer
Telemetry
Altitude
![Page 26: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/26.jpg)
Arianne Rocket
Guidance
Thrust Control
Attitude Control
Accelerometer
Telemetry
AltitudeOverflow!
![Page 27: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/27.jpg)
Arianne Rocket
Guidance
Thrust Control
Attitude Control
Accelerometer
Telemetry
Altitude
![Page 28: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/28.jpg)
Insights? Correctness depends very much on the
environment A component that is correct in setting A
may be incorrect in setting B Components make hidden assumptions Perceived reliability is in part a matter of
experience and comfort with a technology base and its limitations!
![Page 29: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/29.jpg)
Detecting failure Not always necessary: there are ways to
overcome failures that don’t explicitly detect them
But situation is much easier with detectable faults
Usual approach: process does something to say “I am still alive”
Absence of proof of liveness taken as evidence of a failure
![Page 30: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/30.jpg)
Example: pinging with timeouts Programs P and B are the primary,
backup of a service Programs X, Y, Z are clients of the
service All “ping” each other for liveness If a process doesn’t respond to a
few pings, consider it faulty.
![Page 31: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/31.jpg)
Consistent failure detection Impossible in an asynchronous network
that can lose packets: partitioning can mimic failure Best option is to track membership But few systems have GMS services
Many real networks suffer from this problem, hence consistent detection is impossible “in practice” too!
Can always detect failures if risk of mistakes is acceptable
![Page 32: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/32.jpg)
Component failure detection An even harder problem! Now we need to worry
About programs that fail But also about modules that fail
Unclear how to do this or even how to tell Recall that RPC makes component
use rather transparent…
![Page 33: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/33.jpg)
Vogels: the Failure Investigator Argues that we would not consider someone to
have died because they don’t answer the phone
Approach is to consult other data sources: Operating system where process runs Information about status of network routing nodes Can augment with application-specific solutions
Won’t detect program that looks healthy but is actually not operating correctly
![Page 34: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/34.jpg)
Further options: “Hot” button Usually implemented using shared
memory Monitored program must periodically
update a counter in a shared memory region. Designed to do this at some frequency, e.g. 10 times per second.
Monitoring program polls the counter, perhaps 5 times per second. If counter stops changing, kills the “faulty” process and notifies others.
![Page 35: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/35.jpg)
Friedman’s approach Used in a telecommunications co-
processor mockup Can’t wait for failures to be sensed, so
his protocol reissues requests as soon as soon as the reply seems late
Issue of detecting failure becomes a background task; need to do it soon enough so that overhead won’t be excessive or realtime response impacted
![Page 36: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/36.jpg)
Broad picture? Distributed systems have many
components, linked by chains of dependencies
Failures are inevitable, hardware failures are less and less central to availability
Inconsistency of failure detection will introduce inconsistency of behavior and could freeze the application
![Page 37: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/37.jpg)
Suggested solution? Replace critical components with group
of components that can each act on behalf of the original one
Develop a technology by which states can be kept consistent and processes in system can agree on status (operational/failured) of components
Separate handling of partitioning from handling of isolated component failures if possible
![Page 38: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/38.jpg)
Suggested Solution
Program
Moduleit uses
![Page 39: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/39.jpg)
Suggested Solution
Program Moduleit usesModuleit uses
Transparent replicationmulticast
![Page 40: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/40.jpg)
Replication: the key technology Replicate critical components for
availability Replicate critical data: like coherent
caching Replicate critical system state: control
information such as “I’ll do X while you do Y”
In limit, replication and coordination are really the same problem
![Page 41: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/41.jpg)
Basic issues with the approach
We need to understand client-side software architectures better to appreciate the practical limitations on replacing a server with a group
Sometimes, this simply isn’t practical
![Page 42: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/42.jpg)
Client-Server issues
Suppose that a client observes a failure during a request
What should it do?
![Page 43: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/43.jpg)
Client-server issues
Timeout
![Page 44: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/44.jpg)
Client-server issues
What should the client do? No way to know if request was
finished We don’t even know if server really
crashed But suppose it genuinely crashed…
![Page 45: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/45.jpg)
Client-server issues
Timeout
backup
![Page 46: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/46.jpg)
Client-server issues What should client “say” to backup?
Please check on the status of my last request? But perhaps backup has not yet finished the fault-
handling protocol Reissue request?
Not all requests are idempotent
And what about any “cached” server state? Will it need to be refreshed?
Worse still: what if RPC throws an exception? Eg. “demarshalling error”
A risk if failure breaks a stream connection
![Page 47: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/47.jpg)
Client-server issues Client is doing a request that might be
disrupted by failure Must catch this request
Client needs to reconnect Figure out who will take over Wait until it knows about the crash Cached data may no longer be valid Track down outcome of pending requests
Meanwhile must synchronize wrt any new requests that application issues
![Page 48: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/48.jpg)
Client-server issues This argues that we need to make
server failure “transparent” to client But in practice, doing so is hard Normally, this requires deterministic
servers But not many servers are deterministic
Techniques are also very slow…
![Page 49: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/49.jpg)
Client-server issues
Transparency On client side, “nothing happens” On server side
There may be a connection that backup needs to take over
What if server was in the middle of sending a request?
How can backup exactly mimic actions of the primary?
![Page 50: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/50.jpg)
Other approaches to consider N-version programming: use more than
one implementation to overcome software bugs Explicitly uses some form of group
architecture We run multiple copies of the component Compare their outputs and pick majority
Could be identical copies, or separate versions In limit, each is coded by a different team!
![Page 51: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/51.jpg)
Other approaches to consider Even with n-version programming, we
get limited defense against bugs ... studies show that Bohrbugs will occur in
all versions! For Heisenbugs we won’t need multiple versions; running one version multiple times suffices if versions see different inputs or different order of inputs
![Page 52: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/52.jpg)
Logging and checkpoints Processes make periodic checkpoints, log
messages sent in between Rollback to consistent set of checkpoints after
a failure. Technique is simple and costs are low.
But method must be used throughout system and is limited to deterministic programs (everything in the system must satisfy this assumption)
Consequence: useful in limited settings.
![Page 53: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/53.jpg)
Byzantine approach Assumes that failures are arbitrary and may be
malicious Uses groups of components that take actions
by majority consensus only Protocols prove to be costly
3t+1 components needed to overcome t failures Takes a long time to agree on each action
Currently employed mostly in security settings
![Page 54: Reliable Distributed Systems](https://reader036.vdocuments.mx/reader036/viewer/2022062517/56813067550346895d964224/html5/thumbnails/54.jpg)
Hard practical problem Suppose that a distributed system is built from
standard components with application-specific code added to customize behavior
How can such a system be made reliable without rewriting everything from the ground up?
Need a plug-and-play reliability solution If reliability increases complexity, will
reliability technology actually make systems less reliable?