fault tolerance and computing

FAULT TOLERANCE

Fault Tolerance

“is the ability of a system to continue satisfactory operation in the presence of one or more non-simultaneously occurring hardware or software faults.”

• When the system performs a flight critical or flight essential function as

• Defined by – FAR Part 25.1309: Equipment, Systems and Installation – MIL-F-9490: Flight Control Systems – Design, Installation

and Test of Piloted Aircraft, General Specification for.

• In brief, – FAR 25.1309 spec

• Prob. Of failure for a flight critical system of <10-9 per flight hour

– MIL-F-9490 Prob. Of failure resulting in the loss of aircraft of < 10-7 per flight hour.

Significance

Fault Tolerance Systems

• To meet the demanding performance requirements, – FTS uses both type of fault tolerance – Hardware & software– Has the capability of automatic dynamic reconfiguration of the

system.

• Depending upon the level of criticality and allowable POF– Dual, triple or quadruple are used

• Redundancy extends to all hardware elements, such as processors, sensors, actuators and data buses and to the software.

Redundant Elements ?

SimilarSimplifies the design processReduces the costsReduces the programming & verification activities

DissimilarSubstantial increase in Protection against generic faults

Despite cost & complexity, the dissimilar approach is usually chosen to further ensure meeting the system reliability goals.

Reconfiguration

• Is the dynamic reallocation of redundant elements by executive level software in response to failures or changes in the aircraft mission or condition.

• Reconfiguration and anonymity of the active elements to the user are considered basic attributes to FTS.

• In case of failure, the failed unit is switched off line and its functions are assumed by a spare, fault free unit, or different fault detection schemes are invoked.

Fault Detection Scheme

• Central to all FTS principle is Fault Detection which identifies a fault.

• Approaches– Replication (Triple or higher)and voting– Duplication and comparison– Self checking

Electronic Flight Control System Architecture

(most apt application of FTS)

Multicomputer architecture for fault tolerance(MAFT)

MAFT Lane Architecture

Primary Flight Control System configurations

Triplication & voting

Replication and voting

• a highly fault-tolerant voting circuit compares the values from multiple processors computing the same parameter, and if one of the values does not agree with the others, the value is ignored and the processor that generated the suspect value is switched offline.

• Next– a replacement processor can be brought online – Or the system can revert to a lower level of replication – Or to the duplication and comparison mode of operation

• The failed processor may, if so designed, then execute a self diagnostic check and, if no permanent faults are found, return to active status.

Duplication & Comparison

• Two processors compare their outputs with each other, and if they do not agree, the pair of processors collectively drop off line and begin self-diagnostic routines

• If each processor passes its self check, it can return to the active state and pair with another processor, either its previous mate or another, and resume processing.

Self Checking pairs

Self checking pairs

• Simple concept but yet demanding in application• Can detect an error within itself through

reasonableness checks on its intermediate and/or final results without reference to other processors.

• In case of error, it will simply switch itself off and may if so programmed, automatically bring a spare processor on line as a replacement.

Fault Tolerant Software

• Complements Fault tolerant hardware in a flight critical systems.

• Similar parallel concepts like similar and dissimilar redundancy and standby sparing

• Fault tolerant software falls into three categories• Multi-version programming• Recovery Blocks• Exception Handlers

• Beginning the software design process with complete software specification is overemphasized.

Multi-version / N- Version Programming

• Requires development of two or more versions of a program that performs a specific function, developed by separate software team and may even be designed to operate on different processors.

• Accepts common input from an executive level program which in turn also compares the results of the different versions to detect faults.

Multi-version Implementation issues

• Need for specific comparison points such as inputs and output values

• Implementing comparison schemes, in turn implies need for synchronization among the versions so comparison can be made without excessive delay.

• Designer must decide whether the versions are to be executed in parallel or sequentially.

– Trade-off is between minimum hardware and slower execution (sequentially) or more hardware and maximum speed(parallel).

Recovery Blocks

• Acceptability checks are made on the results from a primary version of a program. If the results fail the acceptability checks, an alternative versions of the program that is different from the primary version is invoked and the process of computation and acceptability checks is repeated.

• If no alternative version produces an acceptable results, the software block is judged to have failed.

• Given that a fault has occurred and detected, it now must be recovered from

• Recovery can be either• Forward• Backward

• Backward recovery is exemplified by recovery blocks where, in case of a fault, the executive software reinitialise the program using the same input values as used in previous cycle and attempts to execute the program again.

• Forward recovery is demonstrated in N-version programming where the outputs are compared and erroneous values, generated by faulty software, are ignored and only the correct value is passed to the user.

Run Time Assertions

• Watch dog timers check the time for a block of code to be executed, and if the code is not completed within the prescribed time, it is assumed an error has occurred.

• Straightforward to implement

Analytical Redundancy

• An emerging concept which combines data from the remaining functioning sources with data from other sources of the systems in algorithms that compute the most probable value from the failed sensors.

fault tolerance and computing

Engineering