fault tolerance in distributed systems
DESCRIPTION
Fault Tolerance in Distributed Systems. Gökay Burak AKKUŞ Cmpe516 – Fault Tolerant Computing. Distributed Systems. Main focus on Services based systems Web Services Grid Computing. Service Orientation. diverse programming languages on diverse platforms Span organisational boundaries - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/1.jpg)
Fault Tolerance in Distributed Systems
Gökay Burak AKKUŞCmpe516 – Fault Tolerant Computing
![Page 2: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/2.jpg)
Distributed Systems
Main focus on Services based systems Web Services Grid Computing...
![Page 3: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/3.jpg)
Service Orientation diverse programming languages on diverse platforms Span organisational boundaries Service Oriented Architectures (SOA)
Web Services Grid Computing
SOA is an architectural model that emphasises properties of interoperability and location transparency
Collection of services each service can be considered as a resource that is
either provided or consumed
![Page 4: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/4.jpg)
Dependability Dependability is a collective term that
encompasses Reliability Performance Maintainability Security
Reliability is the part of dependability concerned with the probability that a given system will behave according to its requirements
![Page 5: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/5.jpg)
SOAs
the development and integration of complex systems by representing software functionality as discoverable services on a network.
A traditional way to increase the dependability of distributed systems is through the use of fault tolerance techniques
![Page 6: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/6.jpg)
The approach of design diversity Multi-Version design (MVD)
availability of multiple functionally-equivalent services
![Page 7: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/7.jpg)
Comparison
Single-version system Traditional MVD system Provenance-aware MVD system
![Page 8: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/8.jpg)
CMF
Common mode failure one of shared services fail, then the failure may propagate back to
the calling services. occurs when independent or
nonindependent faults lead to similar errors between versions of an MVD system.
![Page 9: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/9.jpg)
Such failures are a “worst case” scenario in a fault-tolerant system as such failures may be passed through the system undetected
often safer to return no result, and alert an operator and/or place a system in a safestate, than it is to allow an undetected error occur.
![Page 10: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/10.jpg)
![Page 11: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/11.jpg)
CMF by failure of a shared service
reduces the confidence that can be placed in the results of design diversity-based fault tolerance schemes
Provenance introduced as a solution to this problem
![Page 12: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/12.jpg)
Provenance
The provenance of a piece of data is the documentation of process that led to that data.
Provenance can be used for verifying a process, reproduction of a process and providing context to a piece of result
data
![Page 13: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/13.jpg)
Provenance in the context of SOAs interaction provenance
for some data, interaction provenance is the documentation of interactions between actors that led to the data
actor provenance For some data, actor provenance is documentation
that can only be provided by a particular actor pertaining to the process that led to the data
In a workflow based SOA interaction, provenance provides a record of the invocations of all the services that are used in a given workflow, including the input and output data of the various invoked services.
![Page 14: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/14.jpg)
Usage of provenance
Through an analysis of interaction provenance, patterns in workflow execution can be detected
The data of whether a common service was invoked by various other services in a workflow can be used in a fault tolerance algorithm to see if any faults in a workflow stem from the misbehaviour of one service.
![Page 15: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/15.jpg)
Provenance provides a picture of a system's current and past operational state, which can be used to isolate and detect faults
A scheme that performs voting on the results of functionally-equivalent services in order to mask faults of the fault model (next slide) is proposed
![Page 16: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/16.jpg)
![Page 17: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/17.jpg)
![Page 18: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/18.jpg)
PReServ Provenance Recording for Services a Java-based Web Services implementation of the
Provenance Recording Protocol provenance aware SOA by using 3 components
A provenance store that stores, and allows for queries of provenance
A client side library for communicating with the provenance store
A handler for the Apache Axis Web Service container that automatically records interaction provenance for Axis based services and clients by recording incoming and outgoing SOAP messages in a specified provenance store.
![Page 19: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/19.jpg)
MVD system
A service i invokes k services in its workflow a counter Ck stores the number of times a service
k is invoked by MVD channel workflows in the system.
if i produces a result that agrees with the consensus result, then every Sk in that service’s workflow is increased by one, else Sk is set to 0.
weightings of each service k is then calculated as
![Page 20: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/20.jpg)
Voting
FT Grid system used for voting Based on weighting eliminated results
are obtained User defined values are also added
for voting process
![Page 21: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/21.jpg)
If a service k1 has a degree of 1, then only one MVD channel invokes that service
If k1 has a degree of 2, then two MVD channels invoke it
then bias the weightings of Sk based on user-defined settings
Example: a user specifies a bias of 0.95 for a servicewith a
degree of 2 then the final weighting of a service where Si has a
degree of 2 Wi = Si * 0.95 if any service within a given channel fall below a
user-defined minimum weighting, then that channel is discarded from the voting process.
![Page 22: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/22.jpg)
Experiments a total of 12 web services developed and
spread across 5 machines using Apache Tomcat/Axis as a hosting
environment each with provenance functionality, and
each registered with a UDDI server. 5 “Import Duty” services developed 4 “Exchange Rate” services developed 3 “Tax Lookup” services developed
![Page 23: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/23.jpg)
![Page 24: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/24.jpg)
simulate a design defect and/or malicious attack by perturbing code in two of the exchange rate services – ER3 and ER4
probability of failure (in this case, returning an incorrect value) of 0.33 and 0.5 respectively.
![Page 25: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/25.jpg)
Applied Experiments
Experiment 1 Execute a single version client-side
application that invokes a random import duty service, passing it a randomly generated set of parameters.
then compare the result it receives against the fault-free local import duty service, and logs whether or not a correct answer has been returned.
![Page 26: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/26.jpg)
Experiment-2 execute a client-side MVD application with no
provenance capability application invokes all 5 import duty services, and
waits for the first three results to be returned. application discards the results of any import duty
service whose weighting falls below a user-defined value, and performs consensus voting on the remaining results.
if no consensus be reached, or the number of channels to vote on are less than three, then the client waits for an additional MVD channel to return results,
checks the channel’s weighting to see whether it should be discarded, and then votes accordingly.
consensus is reached, or all 5 channels have been This continues until either consensus is reached, or
all 5 channels have been invoked then compare the results
![Page 27: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/27.jpg)
Experiment-3 execute an MVD client-side application with
provenance capability. Client invokes all 5 import duty services, and waits
for the first three results to be returned. Analyzes provenance records of these channels,
and discards the results of any channel that includes a service that falls below a minimum, user-defined weighting.
if no consensus be reached, or the number of channels to vote on be less than three, then the MVD application waits for an additional channel to return results, checks to see if this channel should be discarded, and then votes accordingly.
This continues until either consensus is reached, or all 5 channels have been invoked
Results from the voter are then compared against the local fault free import duty service.
![Page 28: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/28.jpg)
Experimental Results Each experiment iterates 1000 times Each experiment is repeated three
times. test system
Apache Tomcat 5.0.28 Web Services implemented using Apache
Axis 1.1, 5 dual 3Ghz Xeon processor machines Fedora Core Linux 2
![Page 29: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/29.jpg)
Generation of Weightings
history-based weighting scheme used a client application similar to
provenance-aware MVD scheme is ran history weightings based on the
consensus results of 1000 invocations of all five import duty services
No logging or verification of results
![Page 30: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/30.jpg)
![Page 31: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/31.jpg)
![Page 32: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/32.jpg)
the weightings of ER3 and ER4 show significant deviations
This is due to the faults that are injected into ER3 and ER4
Based on the results minimum acceptable weightings are set
![Page 33: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/33.jpg)
![Page 34: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/34.jpg)
Experiment 1- Single version system with no provenance capability
1000 tests on a random import duty service
164 incorrect results 16.4 % undetected incorrect results Time for UDDI query of import duty
service: 279.72 ms Total time until a result: 3895 ms.
![Page 35: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/35.jpg)
![Page 36: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/36.jpg)
Common-mode failures are frequent each channel has an approximately
the same weighting value as there is no provenance data
So unreliable channels are not discarded from voting
Total time for result : 4842 ms 1 sec longer
![Page 37: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/37.jpg)
![Page 38: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/38.jpg)
MVD system with provenance capability
No single common-mode failure occurs
Timing: approximately the same value of experiment-2
![Page 39: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/39.jpg)
![Page 40: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/40.jpg)
Conclusion Solutions for the provision of dependability in service-
oriented architectures are needed Approach: To extend the concept of design-diversity-
based fault tolerance schemes (such as multi-version design) to the service-oriented paradigm
Leverage the benefits of SOAs in order to produce cheaper MVD systems that has traditionally been the case
Problem: Without the knowledge of the workflow of the services that forms channels within the MVD system, the potential arises for multiple channels to depend on the same service
Lead to increased incidence of common mode failure
![Page 41: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/41.jpg)
Conclusion The technique of provenance to analyze a service’s
workflow is proposed An initial scheme that uses provenance to calculate
weightings of channels within an MVD system based on their workflow is detailed
A system is implemented to demonstrate the effectiveness of the scheme
Three different client applications is used to test approach Single-version system: Fail on 16.4% of test iterations Traditional MVD fault tolerance: Fail on 7.6% of test
iterations Provenance-aware MVD scheme: Failure rate of 0.6% More dependable, no-common mode failures occurring &
negligible performance overhead
![Page 42: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/42.jpg)
Finally This paper
Details the potential for provenance data to be used during the voting process of an MVD scheme
Implements an initial proof-of-concept for the approach
Future work will include investigation into obtaining QoS indicators from the
metadata of each service in an MVD channel’s workflow (facilitated through actor provenance) and applying these to the weighting algorithm
investigating the relationship between shared components and common-mode failure in more detail (to more finely tune voting scheme)
![Page 43: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/43.jpg)
References
A Provenance-Aware Weighted Fault Tolerance Scheme for Service Based Applications, 2005
FT-Grid: A Fault-Tolerance System for e-Science, 2005
![Page 44: Fault Tolerance in Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062321/56813c92550346895da63fce/html5/thumbnails/44.jpg)
Questions?