fault tolearant system

FAULT TOLEARANT SYSTEM

A fault tolerant system is a system which is a able to continue operating despite the failure of a limited subset of their hardware or software.

They are gracefully degradable i.e. as the size of the faulty set increases, the system wont collapse suddenly but continue executing, part of its workload.

The goal of this design is to ensure that the probability of system failure is acceptably small.

FAULT TYPES

Hardware Fault: A hardware fault is some physical defect that can cause a component to malfunction. E.g. A broken wire or the output of a logic gate that is perpetually stuck at some logic value(0 or 1).

Software Fault: A software fault is bug that can cause the program to fail for a given set of inputs.

ERROR Error is a manifestation of a fault. e.g. A broken wire will cause an error if the system tries to propagate a signal through it.A program that has a fault that induces incorrect output for some set of inputs will generate errors, if that set of inputs is applied.

FAULT LATENCYThe fault latency is the duration between the onset of a fault and its manifestation as an error.

Since the faults themselves are invisible to the outside world, only showing themselves when they cause errors. Such latency will impact the reliability of the overall system.

ERROR RECOVERY It is the process by which the system attempts to recover from the effects of an error.

TYPES OF ERROR RECOVERYForward Error Recovery: In this type the error is masked without any computations having to be redone.Backward Error Recovery: In this type the system is rolled back to moment in the time before the error is believed to be occurred and computation is carried out again. It consumes additional time to mask the effects of failure.

CAUSES FOR FAULTS

Errors in the specification or design.

Defects in the components

Environmental effects.

Errors In The Specification Or Design

This error arises due to the communication gap between the person who writes the specification and the system designer.

The specification is the link between design process and real world application.

If specification is wrong everything that proceeds from it is likely to be wrong.

Defects In Components This fault arise due to defects caused by the wear and tear of use.

E.g. A mosfet may fail due to electro migration, which is the drifting away overtime of metal atoms towards the cathode.

Environmental Effects

This fault arise due to operating environment .

Devices can be subjected to whole array of stresses, depending on the application.

Poor ventilation or excessively high ambient temperatures can melt components or damage them.

e.g If a computer is in missile, it can undergo high g-forces and vibrational stress.

FAULT TYPESFaults are classified according to their temporal behavior and output behavior.

A fault is said to be active when it is physically capable of generating errors and to be benign when it is not.

TEMPORAL BEHAVIOR CLASSIFICATION

Fault types: Permanent, intermittent, transient.A permanent fault does not die away with time, but remains until it is repaired or the affected unit is replaced.

An intermittent fault cycles between the fault-active and fault benign states.

A transient fault dies away after some time.

Intermittent faults can be caused by loosely connected components.

Transient faults can be caused by environmental effects.

e.g. If there is a burst of electromagnetic radiation and the memory is not properly shielded, the contents of the memory can be altered without the memory chips themselves suffering any structural damage. When the memory is rewritten, the fault will go away.

OUTPUT BEHAVIOR CLASSIFICATION Malicious faults

• Inconsistent output, harder to neutralize these errors

• It behaves arbitrarily Non malicious faults

• Consistent output errors

• Easier to neutralize these errors

Fail stopResponds to up to a certain maximum number of failures by simply stopping, rather than putting out incorrect outputs.

Fail safeIts failure mode is biased so that the application process does not suffer catastrophe upon failure.

INDEPENDENCE AND CORRELATION Component failures may be independent or correlated.

Independent:A failure is said to be independent if it does not directly or indirectly cause another failure.

Correlated:If the failure is said to be correlated if they are related in some way. e.g. They may be triggered by same cause or one of them might cause the others to occur.

FAULT DETECTION There two ways to determine that a processor is malfunctioning • Online• Offline

Online Detection:

•This detection goes in parallel with normal system operation•It is done by checking the behavior that is inconsistent with correct operation.• Indication for faulty processor -Branching to an invalid destination. -Fetching an opcode from a location, which is not containing data.

- Writing into a portion of memory to which the process has no write access.

- Fetching an illegal opcode.- Inactive for more than a prescribed period.

• A monitor is associated with each processor, looking for signs that the processor is faulty. The monitor watches the data and address lines.

• Another approach is to have multiple processors, which are supposed to put out the same result , and compare the results.If a discrepancy arise it indicates an fault.

OFFLINE DETECTION

It is done by running a diagnostic test.

These test are scheduled just like ordinary task.

FAULT AND ERROR CONTAINMENT

The process of preventing the error spreading from one part to another part of the system is called containment

When a fault or error occurs in one part of a system, it will spread through the system like an infectious disease. e.g. An fault in one part of the system might cause large voltage swings in another. A fault-free processor can give erroneous results, when getting input from a faulty unit.

FAULT CONTAINMENT IS ACCOMPLISHED BY

The system is divided into fault and error containment zones(FCZ,ECZ).

An FCZ is a subset of the system that operates correctly despite arbitrary logical or electrical faults outside the subset. i.e. the failure of some part of the computer outside an FCZ cannot cause any element inside the FCZ to fail.

Hardware inside an fcz must be isolated from hardware outside it.It should withstand either a short-circuit or the aplication of the maximum voltage imposed on the lines connecting on FCZ to the outside world.

Each fcz should have an independent power supply and its own clocks. These clocks are synchronized with the clocks in other FCZ’s ,but a malfunction in the outside clocks wont affect the clocks inside the fcz.

The function of an ECZ is to prevent errors from propagating across zone boundaries. This is achieved by voting redundant outputs.

REDUNDANCY FTS consist of properly managed redundancy, i.e. the system is to kept running despite the failure of some its parts.

It must have spare capacity to begin with.

TYPES OF REDUNDANCY• Hardware redundancy• Software redundancy• Time redundancy• Information redundancy

Hardware redundancy Hardware redundancy is the use of additional hardware to compensate for failures. This can be accomplished in two ways.

•One of them is fault detection, correction, and masking.

Fault detection: Multiple hardware units may be assigned to do the same task in parallel and their results are compared. If one are more units are faulty, we can expect this to show up as a disagreement in the result.

Fault Masking: If minority of the units are faulty and a majority of the units produce the same output, the majority result can considered and failure effect is masked.

Fault correction: If minority of the units disagree, the fault is detected. So the computation is repeated on other processors to correct that fault.

• The second one in hardware redundancy is replacing the malfunctioning unit .It is possible that the system can be designed so that faulty units can be easily replaced with spare ones.

Two methods used in hardware redundancy

•Static Pairing

•N modular Redundancy (NMR)

STATIC PAIRING

•Hardwire processors in pairs and to discard the entire pair if one of the processors fails, this is very simple scheme

•The Pairs runs identical software with identical inputs and should generate identical outputs. If the output is not identical, then the pair is non functional, so the entire pair is discarded

•This approach is depicted in the following figure, and it will work only when the interface is working fine and both the processors do not fail identically and around the same time

• The interface is monitored by means of a monitor. If the interface fails, the monitor takes care and if the monitor fails, the interface takes care. If both interface and monitor fails, then the system is down.

N MODULAR REDUNDANCY

•It is a scheme for Forward Error Recovery.

•It works with N processors instead of one and voting on their output and N is usually odd.

•NMR can be illustrated by means of the following two ways

There are N voters and the entire cluster produces N outputs

There is just one voter

• NMR clusters are designed to allow the purging of malfunctioning units. That is, when a failure is detected, the failed unit is checked to see whether or not the failure is transient. If it is not, it must be electrically isolated from the rest of the cluster and a replacement unit is switched on. The faster the unit is replaced, the more reliable the cluster.

• Purging can be done either by hardware or by the operating system.

• Self purging consists of a monitor at each unit comparing its output against the voted output. If there is a difference, the monitor disconnects the unit from the system.

• The monitor can be described as a finite state machine with two states connect and isolate. There are two signals, diff which is set to 1 whenever the module output disagrees with the voter output and reconnect, which is a command from the system to reconnect the module

SOFT WARE REDUNDANCY•Software faults are not like hardware faults i.e. software never wears out , the faults are not generated spontaneously during system operation.

•Software faults can be regarded as faults in design.

•For software redundancy simply replicating the same software N times will not work, all N copies will fail for the same inputs.

•Instead N versions of the software can be implemented. The N versions can be developed by independent teams, with no contact between them.

• Each version is being developed by a team of developers who never communicated with each other

• To minimize the common mode failures

The specifications should be written in formal terms and are subject to rigorous process of checking

Multiple software versions should be developed in different programming languages.

Nature of tools that are being used should be selected properly.

Training and quality of the programmers should be maintainded.

There are two approaches for that

•N Version Programming

•Recovery Block Approach

N Version Programming

Recovery Block Approach

THANK U

fault tolearant system

Technology