1 chapter fault tolerant design of digital systems

13
1 Chapter Fault Tolerant Design of Digital Systems

Post on 20-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Chapter Fault Tolerant Design of Digital Systems

1

ChapterFault Tolerant Design of Digital Systems

Page 2: 1 Chapter Fault Tolerant Design of Digital Systems

2

4.1 The Important of Fault Tolerance

• Fault Tolerant design can provide dramatic improvements in system availability and lead to a substantial reduction in maintenance costs as a consequence of fewer system failures.

• Two different approaches to increase the reliability:

1. Fault prevention

2. Fault tolerance

Page 3: 1 Chapter Fault Tolerant Design of Digital Systems

3

4.2 Basic Concepts of Fault Tolerance

• Fault tolerant system: it is a system which has the built-in capability (without external assistant) to preserve the continued correct execution of its programs and input/output functions in the present of a certain set of operational faults.

• Types of faults:

a) anticipated faults

b) unanticipated faults

Page 4: 1 Chapter Fault Tolerant Design of Digital Systems

4

4.3 Static Redundancy

• Also known as “masking redundancy” • Two major techniques employed:

1. Triple modular redundancy 2. Use of error correcting codes

4.3.1 Triple Modular Redundancy (TMR)

• Could be expanded to NMR (N-modular-redundancy)• An NMR system can tolerate up to n module failures, where n = (N-1)/2• In general, in an NMR system N is an odd number.

Page 5: 1 Chapter Fault Tolerant Design of Digital Systems

5

4.3.1 Triple Modular Redundancy (TMR)

• The Reliability equation of an NMR system is:

• For the TMR case N=3 and n=1

)(

0)1( iN

Mi

M

n

iNMR RR

i

NR

32

323

213

2130

)3(1

0

23

33

)1(3

)1(1

3)1(

0

3

)1(3

MM

MMM

MMM

MMMM

iM

iM

iTMR

RR

RRR

RRR

RRRR

RRi

R

)!(!

!

rnr

n

r

n

nn

n

1

10

Page 6: 1 Chapter Fault Tolerant Design of Digital Systems

6

4.3.1 Triple Modular Redundancy (TMR)

• Note:

• Another way to calculate RTMR

Exercise : Evaluate RTMR if RM = 0.6 and 0.5 and 0.4

)!(!

!

rnr

n

r

n

nn

n

1

10

32

23

23

)1(3

gfunctionin modules any two ofy Probabilit

gfunctionin modules threeall ofy Probabilit

MM

MMM

TMR

RR

RRR

R

Page 7: 1 Chapter Fault Tolerant Design of Digital Systems

7

Reliability & MTBF & Failure rate

For a constant failure rate,

Thus, for TMR where

tM eR

1

MMTBF

110

0

1

00

ttM edtedtR

ttTMR eeR 32 23

6

5

6

49

3

2

2

3

0

3

2

2

3

23

32

0

32

tt

ttTMR

ee

dteeMTBF

1

6

5 : NOTE

Page 8: 1 Chapter Fault Tolerant Design of Digital Systems

8

• We should look for a more useful parameter than MTBF.

Other Parameters for evaluating system reliability• Reliability Improvement Factor (RIF) =

Where, 1-RN : probability of failure of non-redundant system.

1-RR : probability of failure of redundant system.

• Mission Time Improvement Factor (MTIF) =

Where Rf is some predetermined reliability (e.g. 0.99 or 0.90), while TR and TN are times at which the system reliability RR(t) and RN(t), respectively, fall to the value Rf .

R

N

R

R

1

1

fN

R RatT

T

Page 9: 1 Chapter Fault Tolerant Design of Digital Systems

9

The reliability of the voter element

• If the voter has the reliability , then the reliability of the TMR becomes:

• If , the reliability of the system is less than that of the original system for all t. Thus, we have to improve the reliability of the voter.

where, Rv is the reliability of the voter.

tve

)23( 32 tttTMR eeeR v

v

)1()(3)( 23VMVMVMsys RRRRRRR

Page 10: 1 Chapter Fault Tolerant Design of Digital Systems

10

The major advantages of the TMR schemeMajor advantages of the TMR are:

1. The fault-masking action occurs immediately; both temporary and permanent faults are masked.

2. No separate fault detection is necessary before masking.

3. The conversion from a non-redundant system to a TMR system is straightforward.

Page 11: 1 Chapter Fault Tolerant Design of Digital Systems

11

4.4 Dynamic redundancy

• A system with dynamic redundancy consists of several modules but with only one operating at a time.

• If a fault is detected in the operating module it is switched out and replaced by a spare.

• It requires consecutive actions of fault detection and fault recovery.

• A dynamic redundant system with S spares has a reliability :

where Rm is the reliability of each module, active or spare in the system. This reliability function is obtained assuming that the fault detection and the switchover mechanism are perfect.

)1()1(1 SmRR

Page 12: 1 Chapter Fault Tolerant Design of Digital Systems

12

4.4 Dynamic redundancy

• The reliability R is an increasing function of the number of spare modules.

• However, the use of too many spares may have a detrimental effect on the system reliability.

• Losq has shown that for every dynamic redundant system there exists a finite best number of spares for a given mission time:

• When the mission time is extremely short one spare is best.• When the mission time is less than one-tenth of the simplex (i.e. non-

redundant) mean-life five spares or fewer is the best.

Page 13: 1 Chapter Fault Tolerant Design of Digital Systems

13

4.4 Dynamic redundancy

• The detection of a fault in the individual modules of a dynamic system can be achieved by using one of the following techniques:

1. Periodic tests: • Offline.• Disadvantage: cannot detect temporary faults unless they occur

while the module is tested.2. Self-checking circuits: provide a very cost effective method of fault

detection3. Watchdog timers: timer, checkpoints

• Reconfiguration: switching the faulty element and selecting the system output to come from one of the alternative modules.

• Retry: so that a module will not be removed because of a temporary fault.

• Self-repair: the replacement is invisible to the user and the system continues its operation uninterrupted.

• In general dynamic redundant systems can be divided into two categories:a) Cold-standby system.b) Hot-standby system.