failure mode assumptions and assumption coverage
DESCRIPTION
Failure Mode Assumptions and Assumption Coverage. David Powell. Fault-Tolerance. Key questions How components may fail? Prevention strategies At what rate they may fail? The Amount of redundancy needed What are the important type of faults? Types of redundancy needed - PowerPoint PPT PresentationTRANSCRIPT
Failure Mode Assumptions Failure Mode Assumptions and Assumption Coverageand Assumption Coverage
David PowellDavid Powell
Fault-ToleranceFault-Tolerance
Key questionsKey questions– How components may fail?How components may fail?
Prevention strategiesPrevention strategies– At what rate they may fail? At what rate they may fail?
The Amount of redundancy neededThe Amount of redundancy needed– What are the important type of faults? What are the important type of faults?
Types of redundancy neededTypes of redundancy needed– The relation between dependability, The relation between dependability,
redundancy and faults? redundancy and faults? General FT design guidelinesGeneral FT design guidelines
An F-T Paradox/DilemmaAn F-T Paradox/Dilemma
More faultyMore faulty More redundancyMore redundancy More possibility of faultsMore possibility of faults
??????
Solution- Some Key StepsSolution- Some Key Steps
Classify, quantify and verify the Classify, quantify and verify the assumptionsassumptions
Type of FailuresType of Failures
OverviewOverview
Single-user serviceSingle-user service– Service ModelService Model– Potential ErrorsPotential Errors
Multiple-user serviceMultiple-user service– Service ModelService Model– Potential ErrorsPotential Errors
Single-user Service ModelSingle-user Service Model
Service items: sService items: s ii, i=1,2,…, i=1,2,…
Values of sValues of sii: vs: vsii
Observation time of sObservation time of s ii: ts: tsii
Service Model: Service Model:
SSii= = <vs<vsii, ts, tsii>>An omniscient observerAn omniscient observer
Correctness ModelCorrectness Model
Service item sService item sii is correct iff is correct iff
(vs(vsii SV SVii) ) (ts (tsii ST STii) )
SVSVii and ST and STii are respectively the specified are respectively the specified sets of values and times for service item ssets of values and times for service item s ii
Potential ErrorsPotential Errors
Arbitrary value error: sArbitrary value error: s ii : vs : vsii SV SVii
Noncode error: sNoncode error: s ii : vs : vsii CV CV (CV defines a (CV defines a code)code)Arbitrary timing error: sArbitrary timing error: s ii : ts : tsii ST STii
Early timing error: sEarly timing error: s ii : ts : tsii < min(ST < min(STii))Late timing error: sLate timing error: sii : ts : tsii > max(ST > max(STii))Omission error: sOmission error: sii : ts : tsi i = = Impromptu error: sImpromptu error: s ii: (vs: (vsii = = ) ) (ts (tsi i = = ) )
Multi-user Service ModelMulti-user Service Model
Service item sService item sii={s={sii(1), s(1), sii(2),…, s(2),…, sii(n),}(n),}
Service model: <vsService model: <vsii(u), ts(u), tsii(u)>, all i,u(u)>, all i,u
New issues: “consistency”New issues: “consistency”
Correctness ModelCorrectness Model
vsvsii(u)– the value of service item i on process u (u)– the value of service item i on process u vsvsii-- the value of service item i -- the value of service item i SVSVii– the set of specified service item i– the set of specified service item itstsii(u)– the observation time of service item i on process u(u)– the observation time of service item i on process uSTSTii(u) – the range of specified observation time of service (u) – the range of specified observation time of service item i on process uitem i on process uuvuv -- the time bound of related occurrences -- the time bound of related occurrences
Examples of Potential ErrorsExamples of Potential Errors
Consistent value errorConsistent value error
Consistent timing errorConsistent timing error
Semi-consistent value errorSemi-consistent value error
Failure Mode AssumptionsFailure Mode Assumptions
Attempt to formalize the concept of an Attempt to formalize the concept of an assumed failure modeassumed failure modeBy assertions on the sequences of service By assertions on the sequences of service items delivered by a componentitems delivered by a component
Examples of Value Error AssertionsExamples of Value Error Assertions
No value errors occur (VNo value errors occur (Vnonenone))
i , vsi , vsii SV SVii
The only value errors that occur are noncode The only value errors that occur are noncode value errors (Vvalue errors (Vnn))
i , (vsi , (vsii SV SVii) ) (vs (vsii CV CV ))
Arbitrary value error can occur (VArbitrary value error can occur (Varbarb))
i , (vsi , (vsii SV SVii) ) (vs (vsii SV SVi i ))
Examples of Timing Error Examples of Timing Error AssertionsAssertions
No timing error occurs (TNo timing error occurs (Tnonenone))
The only timing errors are omission errors (TThe only timing errors are omission errors (TOO))
The only timing errors are late timing errors (TThe only timing errors are late timing errors (TLL))
The only timing errors are early timing errors (TThe only timing errors are early timing errors (TEE))
Arbitrary timing error can occur (TArbitrary timing error can occur (Tarbarb))
Permanent omission/crash (TPermanent omission/crash (Tpp))
Bounded omission degree (TBounded omission degree (TBkBk))
Timing Error ImplicationsTiming Error Implications
Failure Mode Assertions(FMA)Failure Mode Assertions(FMA)
A complete FMA entails an assertion on A complete FMA entails an assertion on errors occurring on both value and time errors occurring on both value and time domainsdomainsBy taking the Cartesian production of the By taking the Cartesian production of the two domains, we get a family of FMAtwo domains, we get a family of FMA
FMA Implication GraphFMA Implication Graph
So what?So what?
The FMA classification and implication The FMA classification and implication graph can serve as a guideline to design graph can serve as a guideline to design families of FT algorithms that can process families of FT algorithms that can process errors in increasing severity!errors in increasing severity!
Assumption CoverageAssumption Coverage
Establishing a link between assumed Establishing a link between assumed component failure mode and system component failure mode and system dependabilitydependability(The design a FT system relies on the (The design a FT system relies on the assumption they make)assumption they make)(The dependability of a FT system is related (The dependability of a FT system is related to the failure mode they assume) to the failure mode they assume)
MotivationMotivation
Components may failComponents may failThey may fail in a bad way They may fail in a bad way leads to a leads to a violation of assumptions of the systemviolation of assumptions of the systemThe system, in turn, can failThe system, in turn, can fail
Question: to what degree can a Question: to what degree can a component FMA prove to be true in the component FMA prove to be true in the real system?real system?
The Coverage of the AssumptionThe Coverage of the Assumption
DefinitionDefinition P(X) = Pr{ X= true | component failed}P(X) = Pr{ X= true | component failed}
P(VP(Varbarb T Tarbarb) = 1) = 1
P(VP(Vnonenone T Tnonenone) = 0) = 0
Coverage of an FT systemCoverage of an FT system
PS(X) = PS(X) = Pr{ correct error processing |X= true}Pr{ correct error processing |X= true} *Pr{ X= true | component failed}*Pr{ X= true | component failed}
Influence of Assumption Influence of Assumption Coverage on System Coverage on System
DependabilityDependability
A Case StudyA Case Study
The System The System A system of n processorsA system of n processorsConnected via unidirectional message-passing busConnected via unidirectional message-passing busEach processor carries out the same computation stepsEach processor carries out the same computation stepsThe result of each processing step is communicated to The result of each processing step is communicated to all other processorsall other processorsEach process has a decision function (DF)Each process has a decision function (DF)The DF is applied to the results received from other The DF is applied to the results received from other processorsprocessors……Each processor and its associated bus is viewed as a Each processor and its associated bus is viewed as a single componentsingle component
Fail-Silent Processor-busFail-Silent Processor-busA fail-silent processor A fail-silent processor – Only has semi-consistent value errorsOnly has semi-consistent value errors– Always produces message on time Always produces message on time – Or ceases to produce messages foreverOr ceases to produce messages forever– If a message is delivered to a processor, it is to be delivered to If a message is delivered to a processor, it is to be delivered to
all processors with consistent fixed delay all processors with consistent fixed delay
Fail-Consistent Processor BusFail-Consistent Processor Bus
Only semi-consistent value errors may occur Only semi-consistent value errors may occur Faulty processors may send erroneous valuesFaulty processors may send erroneous valuesConsistent timing error may occurConsistent timing error may occur
Fail-uncontrolled Processor BusFail-uncontrolled Processor Bus
Arbitrary timing errorArbitrary timing errorArbitrary value errorArbitrary value error
Implications of Assumption Implications of Assumption CoverageCoverage
Failure mode relationsFailure mode relations
Coverage relationsCoverage relations
Dependability Expressions From Dependability Expressions From Markov ModelsMarkov Models
r = e r = e ––λλtt
λλ = failure rate = failure rate
A Life-critical ApplicationA Life-critical Application
System reliability objective: R > 1-10System reliability objective: R > 1-10-9-9 over over 10 hours10 hoursSingle processor reliability: Single processor reliability: – r = er = e--λλtt – 1/1/λλ = 5 years = 5 years
A Money-Critical ApplicationA Money-Critical Application
It is about availability of the system rather It is about availability of the system rather than reliability of the systemthan reliability of the systemPlease look at the paper for more detailsPlease look at the paper for more details
Unavailability v.s. CoverageUnavailability v.s. Coverage
ConclusionConclusion
A formalism for describing component A formalism for describing component failure modesfailure modesMultiplicity of value and timing errorsMultiplicity of value and timing errorsThe notion of assumption coverageThe notion of assumption coverageThe relation between dependability, The relation between dependability, availability and assumption coverageavailability and assumption coverage
Thank youThank you