failure mode assumptions and assumption coverage

36
Failure Mode Failure Mode Assumptions and Assumptions and Assumption Coverage Assumption Coverage David Powell David Powell

Upload: trisha

Post on 25-Feb-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Failure Mode Assumptions and Assumption Coverage. David Powell. Fault-Tolerance. Key questions How components may fail?  Prevention strategies At what rate they may fail?  The Amount of redundancy needed What are the important type of faults? Types of redundancy needed - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Failure Mode Assumptions and Assumption Coverage

Failure Mode Assumptions Failure Mode Assumptions and Assumption Coverageand Assumption Coverage

David PowellDavid Powell

Page 2: Failure Mode Assumptions and Assumption Coverage

Fault-ToleranceFault-Tolerance

Key questionsKey questions– How components may fail?How components may fail?

Prevention strategiesPrevention strategies– At what rate they may fail? At what rate they may fail?

The Amount of redundancy neededThe Amount of redundancy needed– What are the important type of faults? What are the important type of faults?

Types of redundancy neededTypes of redundancy needed– The relation between dependability, The relation between dependability,

redundancy and faults? redundancy and faults? General FT design guidelinesGeneral FT design guidelines

Page 3: Failure Mode Assumptions and Assumption Coverage

An F-T Paradox/DilemmaAn F-T Paradox/Dilemma

More faultyMore faulty More redundancyMore redundancy More possibility of faultsMore possibility of faults

??????

Page 4: Failure Mode Assumptions and Assumption Coverage

Solution- Some Key StepsSolution- Some Key Steps

Classify, quantify and verify the Classify, quantify and verify the assumptionsassumptions

Page 5: Failure Mode Assumptions and Assumption Coverage

Type of FailuresType of Failures

Page 6: Failure Mode Assumptions and Assumption Coverage

OverviewOverview

Single-user serviceSingle-user service– Service ModelService Model– Potential ErrorsPotential Errors

Multiple-user serviceMultiple-user service– Service ModelService Model– Potential ErrorsPotential Errors

Page 7: Failure Mode Assumptions and Assumption Coverage

Single-user Service ModelSingle-user Service Model

Service items: sService items: s ii, i=1,2,…, i=1,2,…

Values of sValues of sii: vs: vsii

Observation time of sObservation time of s ii: ts: tsii

Service Model: Service Model:

SSii= = <vs<vsii, ts, tsii>>An omniscient observerAn omniscient observer

Page 8: Failure Mode Assumptions and Assumption Coverage

Correctness ModelCorrectness Model

Service item sService item sii is correct iff is correct iff

(vs(vsii SV SVii) ) (ts (tsii ST STii) )

SVSVii and ST and STii are respectively the specified are respectively the specified sets of values and times for service item ssets of values and times for service item s ii

Page 9: Failure Mode Assumptions and Assumption Coverage

Potential ErrorsPotential Errors

Arbitrary value error: sArbitrary value error: s ii : vs : vsii SV SVii

Noncode error: sNoncode error: s ii : vs : vsii CV CV (CV defines a (CV defines a code)code)Arbitrary timing error: sArbitrary timing error: s ii : ts : tsii ST STii

Early timing error: sEarly timing error: s ii : ts : tsii < min(ST < min(STii))Late timing error: sLate timing error: sii : ts : tsii > max(ST > max(STii))Omission error: sOmission error: sii : ts : tsi i = = Impromptu error: sImpromptu error: s ii: (vs: (vsii = = ) ) (ts (tsi i = = ) )

Page 10: Failure Mode Assumptions and Assumption Coverage

Multi-user Service ModelMulti-user Service Model

Service item sService item sii={s={sii(1), s(1), sii(2),…, s(2),…, sii(n),}(n),}

Service model: <vsService model: <vsii(u), ts(u), tsii(u)>, all i,u(u)>, all i,u

New issues: “consistency”New issues: “consistency”

Page 11: Failure Mode Assumptions and Assumption Coverage

Correctness ModelCorrectness Model

vsvsii(u)– the value of service item i on process u (u)– the value of service item i on process u vsvsii-- the value of service item i -- the value of service item i SVSVii– the set of specified service item i– the set of specified service item itstsii(u)– the observation time of service item i on process u(u)– the observation time of service item i on process uSTSTii(u) – the range of specified observation time of service (u) – the range of specified observation time of service item i on process uitem i on process uuvuv -- the time bound of related occurrences -- the time bound of related occurrences

Page 12: Failure Mode Assumptions and Assumption Coverage

Examples of Potential ErrorsExamples of Potential Errors

Consistent value errorConsistent value error

Consistent timing errorConsistent timing error

Semi-consistent value errorSemi-consistent value error

Page 13: Failure Mode Assumptions and Assumption Coverage

Failure Mode AssumptionsFailure Mode Assumptions

Attempt to formalize the concept of an Attempt to formalize the concept of an assumed failure modeassumed failure modeBy assertions on the sequences of service By assertions on the sequences of service items delivered by a componentitems delivered by a component

Page 14: Failure Mode Assumptions and Assumption Coverage

Examples of Value Error AssertionsExamples of Value Error Assertions

No value errors occur (VNo value errors occur (Vnonenone))

i , vsi , vsii SV SVii

The only value errors that occur are noncode The only value errors that occur are noncode value errors (Vvalue errors (Vnn))

i , (vsi , (vsii SV SVii) ) (vs (vsii CV CV ))

Arbitrary value error can occur (VArbitrary value error can occur (Varbarb))

i , (vsi , (vsii SV SVii) ) (vs (vsii SV SVi i ))

Page 15: Failure Mode Assumptions and Assumption Coverage

Examples of Timing Error Examples of Timing Error AssertionsAssertions

No timing error occurs (TNo timing error occurs (Tnonenone))

The only timing errors are omission errors (TThe only timing errors are omission errors (TOO))

The only timing errors are late timing errors (TThe only timing errors are late timing errors (TLL))

The only timing errors are early timing errors (TThe only timing errors are early timing errors (TEE))

Arbitrary timing error can occur (TArbitrary timing error can occur (Tarbarb))

Permanent omission/crash (TPermanent omission/crash (Tpp))

Bounded omission degree (TBounded omission degree (TBkBk))

Page 16: Failure Mode Assumptions and Assumption Coverage

Timing Error ImplicationsTiming Error Implications

Page 17: Failure Mode Assumptions and Assumption Coverage

Failure Mode Assertions(FMA)Failure Mode Assertions(FMA)

A complete FMA entails an assertion on A complete FMA entails an assertion on errors occurring on both value and time errors occurring on both value and time domainsdomainsBy taking the Cartesian production of the By taking the Cartesian production of the two domains, we get a family of FMAtwo domains, we get a family of FMA

Page 18: Failure Mode Assumptions and Assumption Coverage

FMA Implication GraphFMA Implication Graph

Page 19: Failure Mode Assumptions and Assumption Coverage

So what?So what?

The FMA classification and implication The FMA classification and implication graph can serve as a guideline to design graph can serve as a guideline to design families of FT algorithms that can process families of FT algorithms that can process errors in increasing severity!errors in increasing severity!

Page 20: Failure Mode Assumptions and Assumption Coverage

Assumption CoverageAssumption Coverage

Establishing a link between assumed Establishing a link between assumed component failure mode and system component failure mode and system dependabilitydependability(The design a FT system relies on the (The design a FT system relies on the assumption they make)assumption they make)(The dependability of a FT system is related (The dependability of a FT system is related to the failure mode they assume) to the failure mode they assume)

Page 21: Failure Mode Assumptions and Assumption Coverage

MotivationMotivation

Components may failComponents may failThey may fail in a bad way They may fail in a bad way leads to a leads to a violation of assumptions of the systemviolation of assumptions of the systemThe system, in turn, can failThe system, in turn, can fail

Question: to what degree can a Question: to what degree can a component FMA prove to be true in the component FMA prove to be true in the real system?real system?

Page 22: Failure Mode Assumptions and Assumption Coverage

The Coverage of the AssumptionThe Coverage of the Assumption

DefinitionDefinition P(X) = Pr{ X= true | component failed}P(X) = Pr{ X= true | component failed}

P(VP(Varbarb T Tarbarb) = 1) = 1

P(VP(Vnonenone T Tnonenone) = 0) = 0

Page 23: Failure Mode Assumptions and Assumption Coverage

Coverage of an FT systemCoverage of an FT system

PS(X) = PS(X) = Pr{ correct error processing |X= true}Pr{ correct error processing |X= true} *Pr{ X= true | component failed}*Pr{ X= true | component failed}

Page 24: Failure Mode Assumptions and Assumption Coverage

Influence of Assumption Influence of Assumption Coverage on System Coverage on System

DependabilityDependability

A Case StudyA Case Study

Page 25: Failure Mode Assumptions and Assumption Coverage

The System The System A system of n processorsA system of n processorsConnected via unidirectional message-passing busConnected via unidirectional message-passing busEach processor carries out the same computation stepsEach processor carries out the same computation stepsThe result of each processing step is communicated to The result of each processing step is communicated to all other processorsall other processorsEach process has a decision function (DF)Each process has a decision function (DF)The DF is applied to the results received from other The DF is applied to the results received from other processorsprocessors……Each processor and its associated bus is viewed as a Each processor and its associated bus is viewed as a single componentsingle component

Page 26: Failure Mode Assumptions and Assumption Coverage

Fail-Silent Processor-busFail-Silent Processor-busA fail-silent processor A fail-silent processor – Only has semi-consistent value errorsOnly has semi-consistent value errors– Always produces message on time Always produces message on time – Or ceases to produce messages foreverOr ceases to produce messages forever– If a message is delivered to a processor, it is to be delivered to If a message is delivered to a processor, it is to be delivered to

all processors with consistent fixed delay all processors with consistent fixed delay

Page 27: Failure Mode Assumptions and Assumption Coverage

Fail-Consistent Processor BusFail-Consistent Processor Bus

Only semi-consistent value errors may occur Only semi-consistent value errors may occur Faulty processors may send erroneous valuesFaulty processors may send erroneous valuesConsistent timing error may occurConsistent timing error may occur

Page 28: Failure Mode Assumptions and Assumption Coverage

Fail-uncontrolled Processor BusFail-uncontrolled Processor Bus

Arbitrary timing errorArbitrary timing errorArbitrary value errorArbitrary value error

Page 29: Failure Mode Assumptions and Assumption Coverage

Implications of Assumption Implications of Assumption CoverageCoverage

Failure mode relationsFailure mode relations

Coverage relationsCoverage relations

Page 30: Failure Mode Assumptions and Assumption Coverage

Dependability Expressions From Dependability Expressions From Markov ModelsMarkov Models

r = e r = e ––λλtt

λλ = failure rate = failure rate

Page 31: Failure Mode Assumptions and Assumption Coverage

A Life-critical ApplicationA Life-critical Application

System reliability objective: R > 1-10System reliability objective: R > 1-10-9-9 over over 10 hours10 hoursSingle processor reliability: Single processor reliability: – r = er = e--λλtt – 1/1/λλ = 5 years = 5 years

Page 32: Failure Mode Assumptions and Assumption Coverage
Page 33: Failure Mode Assumptions and Assumption Coverage

A Money-Critical ApplicationA Money-Critical Application

It is about availability of the system rather It is about availability of the system rather than reliability of the systemthan reliability of the systemPlease look at the paper for more detailsPlease look at the paper for more details

Page 34: Failure Mode Assumptions and Assumption Coverage

Unavailability v.s. CoverageUnavailability v.s. Coverage

Page 35: Failure Mode Assumptions and Assumption Coverage

ConclusionConclusion

A formalism for describing component A formalism for describing component failure modesfailure modesMultiplicity of value and timing errorsMultiplicity of value and timing errorsThe notion of assumption coverageThe notion of assumption coverageThe relation between dependability, The relation between dependability, availability and assumption coverageavailability and assumption coverage

Page 36: Failure Mode Assumptions and Assumption Coverage

Thank youThank you