treatment of general dependencies in system fault-tree and risk analysis

10
278 IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 3, SEPTEMBER 2002 Treatment of General Dependencies in System Fault-Tree and Risk Analysis Jussi K. Vaurio Abstract—Implicit and explicit methods are described for re- liability and risk analysis of systems with dependent or correlated basic events. General rules are presented for modeling any group of mutually -dependent events with -independent events. The probabilities of these virtual events are determined based on the joint probabilities of the original -dependent events, typically known by -correlation or conditional probabilities. The transfor- mations preserve the values of all terms (e.g., minimal cut sets), independent of system success criteria. This facilitates general use of ordinary fault-tree computer codes that assume basic events to be -independent. Explicit basic event probabilities are obtained for calculating the probability of failure on demand of standby safety systems when the -dependency is caused by scheduling and synchronization of test episodes between redundant components , and by statistical variation of failure rates. In- teresting “negative probabilities” are encountered in this exercise, mainly due to negative -correlation between the component un- availabilities with staggered testing. Results obtained for human- error events are useful when the conditional probability to repeat an error is larger than the probability of an error in a single isolated task. Explicit results are obtained for systems with time-related common-cause failures modeled by general multiple failure rates. The impacts of test intervals and test staggering are included. Stag- gered testing is optimal with an ETR (extra-testing rule), although ETR is not important for 1-out-of- systems. An economic model provides insights into the impacts of various parameters: the optimal test interval increases with increasing redundancy and testing cost, and it decreases with increasing accident cost and initi- ating event rate. Staggered testing with ETR allows for the longest optimal test intervals. Rules are presented for changing -depen- dency probabilities when some component is known to be failed. Current fault-tree quantification tools are not well geared to use the implicit method in spite of the fact that it would simplify the fault-tree construction, reduce the number of cut sets, and allow different types of dependencies or correlations in the analysis. A recommendation is to computerize the implicit method or include it as an option to current codes. It would need only a data table for joint probabilities and the ability to pick-up data from this table whenever two or more of the -dependent events appear in a term (or a cut set). Index Terms—Common-cause failure, dependent failure, fault tree, implicit/explicit method, standby/safety system. ACRONYMS AND ABBREVIATIONS 1 CCF common-cause failure: simultaneous same-cause failure of multiple components CD complete -dependency Manuscript received January 18, 2000; revised February 12, 2001. Respon- sible Editor: W. H. Sanders. The author is with Fortum Power and Heat Oy, 07901 Loviisa, Finland (e-mail: [email protected]). Publisher Item Identifier 10.1109/TR.2002.801848. 1 The singular and plural of an acronym are always spelled the same. ETR extra testing rule: a way to identify and repair other components of a group when one component is found failed GMFR general multiple failure rate HD high -dependency LD low -dependency MD medium -dependency PSA probabilistic safety assessment r.v. random variable - implies: statistical(ly) SFRM system failure rate model ZD zero -dependency NOTATION cost of an accident cost of repairing 1 train (out of ) cost of testing 1 train (out of ) average cost rate as a function of frequency of initiating events: demands for a standby system -out-of- system average unavailability of a standby -system union of sets: OR-gate in a fault tree intersection of sets: AND-gate in a fault tree unavailability of specific basic events due to a common cause that influences exactly these events and no others, Boolean variable: a certain failure mode state of a specific component due to any cause Boolean variable: failed state of events due to a common cause; -independent of other events negation of rate of entering single-failure rate, same for -identical compo- nents (Section III) standard deviation of failure-rate of specific components due to a common cause in a group of -identical compo- nents, a distributed r.v. with degrees of freedom, di- vided by test interval 0018-9529/02$17.00 © 2002 IEEE

Upload: jk

Post on 25-Sep-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Treatment of general dependencies in system fault-tree and risk analysis

278 IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 3, SEPTEMBER 2002

Treatment of General Dependencies in SystemFault-Tree and Risk Analysis

Jussi K. Vaurio

Abstract—Implicit and explicit methods are described for re-liability and risk analysis of systems with dependent or correlatedbasic events. General rules are presented for modeling any group of

mutually -dependent events with2 1 -independent events.The probabilities of these virtual events are determined based onthe joint probabilities of the original -dependent events, typicallyknown by -correlation or conditional probabilities. The transfor-mations preserve the values of all terms (e.g., minimal cut sets),independent of system success criteria. This facilitates general useof ordinary fault-tree computer codes that assume basic events tobe -independent. Explicit basic event probabilities are obtainedfor calculating the probability of failure on demand of standbysafety systems when the-dependency is caused by scheduling andsynchronization of test episodes between redundant components(1 4), and by statistical variation of failure rates. In-teresting “negative probabilities” are encountered in this exercise,mainly due to negative -correlation between the component un-availabilities with staggered testing. Results obtained for human-error events are useful when the conditional probability to repeatan error is larger than the probability of an error in a single isolatedtask. Explicit results are obtained for systems with time-relatedcommon-cause failures modeled by general multiple failure rates.The impacts of test intervals and test staggering are included. Stag-gered testing is optimal with an ETR (extra-testing rule), althoughETR is not important for 1-out-of- : systems. An economicmodel provides insights into the impacts of various parameters:the optimal test interval increases with increasing redundancy andtesting cost, and it decreases with increasing accident cost and initi-ating event rate. Staggered testing with ETR allows for the longestoptimal test intervals. Rules are presented for changing -depen-dency probabilities when some component is known to be failed.

Current fault-tree quantification tools are not well geared to usethe implicit method in spite of the fact that it would simplify thefault-tree construction, reduce the number of cut sets, and allowdifferent types of dependencies or correlations in the analysis. Arecommendation is to computerize the implicit method or includeit as an option to current codes. It would need only a data table forjoint probabilities and the ability to pick-up data from this tablewhenever two or more of the -dependent events appear in a term(or a cut set).

Index Terms—Common-cause failure, dependent failure, faulttree, implicit/explicit method, standby/safety system.

ACRONYMS AND ABBREVIATIONS1

CCF common-cause failure: simultaneous same-causefailure of multiple components

CD complete -dependency

Manuscript received January 18, 2000; revised February 12, 2001. Respon-sible Editor: W. H. Sanders.

The author is with Fortum Power and Heat Oy, 07901 Loviisa, Finland(e-mail: [email protected]).

Publisher Item Identifier 10.1109/TR.2002.801848.

1The singular and plural of an acronym are always spelled the same.

ETR extra testing rule: a way to identify and repair othercomponents of a group when one component isfound failed

GMFR general multiple failure rateHD high -dependencyLD low -dependencyMD medium -dependencyPSA probabilistic safety assessmentr.v. random variable- implies: statistical(ly)

SFRM system failure rate modelZD zero -dependency

NOTATION

cost of an accidentcost of repairing 1 train (out of)cost of testing 1 train (out of)average cost rate as a function offrequency of initiating events: demands for astandby system

-out-of- systemaverage unavailability of a standby -systemunion of sets: OR-gate in a fault treeintersection of sets: AND-gate in a fault treeunavailability of specific basic events due to acommon cause that influences exactly theseeventsand no others,Boolean variable: a certain failure mode state of aspecific component due to any cause

Boolean variable: failed state of events dueto a common cause;-independent of other events

negation of

rate of enteringsingle-failure rate, same for -identical compo-nents (Section III)standard deviation offailure-rate of specific components due to acommon cause in a group of -identical compo-nents,a distributed r.v. with degrees of freedom, di-vided bytest interval

0018-9529/02$17.00 © 2002 IEEE

Page 2: Treatment of general dependencies in system fault-tree and risk analysis

VAURIO: TREATMENT OF GENERAL DEPENDENCIES IN SYSTEM FAULT-TREE AND RISK ANALYSIS 279

time from a test of a component to the next testnumber of -identical components subject to similarfailurescomponent unavailability, tested in sequence

human error in the first task of consecutivesimilar tasks (Section IV)

repeating an error for the th time it hasjust occurred consecutivelytimes (Section IV)

a human error in a taskno error was made inthe preceding taskerror dependency parameter in Handbook for-malism.

I. INTRODUCTION

FAULT tree analysis [1] is a common technique used inlarge-system reliability analysis and in PSA [2]. The qual-

itative phase of the analysis solves the system logic functionin terms of the minimal cut sets, as a Boolean sum of prod-ucts of basic events (component failure modes). The quantita-tive phase calculates the system failed-state probability. Mostcomputer codes assume the basic events to be mutually-inde-pendent. General-dependencies are not routinely considered.

There are basically two ways to incorporate-dependent fail-ures in system analysis: “implicit” and “explicit” methods. Inmultiversion software studies these correspond to “-correlatedfailures” and “differentiated causes” approaches, respectively[16]. In the implicit method the minimal cut sets and theprobability equation are first presented as if all events(refers to component) were mutually -independent basicevents. In each term any product is thenreplaced with , with

, etc., as -dependency in general means thatthe joint probabilities are not equal to the products of individualbasic-event probabilities. This approach applies if the jointprobabilities are known or can be determinedthrough -correlations or conditional probabilities. Examplesof this kind can be found in calculating the time-averageunavailability of a standby safety system, or when repeatablehuman errors are included in a system analysis.

Certain types of -dependencies can be more easily mod-eled in a fault tree explicitly as mutually-independent basicevents , influencing specific components (andno others). In this case each component-level eventis re-placed with the union (Boolean OR-gate) of all eventswhich fail component (indicated by as any 1 of the subindexesof ). This is illustrated for in Figs. 1 and 2 for 3 -de-pendent failures. Similarly is replaced with

. The explicit method has been used mainlyfor CCF with empirically or parametrically determined proba-bilities [3]. An advantage of the explicit method isthat most computer codes solving fault trees can automaticallyhandle mutually -independent CCF-events.

Solving large fault-tree problems with the implicit methodas well as the benefits and limitations of the method have beenstudied [4]. The number of events and minimal cut-sets issmaller than in the explicit method. Potential drawbacks of theimplicit method are:

Fig. 1. Component-level fault-tree (example).

Fig. 2. Component eventX modeled by cause-eventsZ , Z , Z .

1) A large number of terms (minimal cut sets) might needto be manually quantified (unless the process is computerized),and

2) Truncation of the system equation (necessary in large sys-tems) carries the risk of losing terms or events that could beimportant through -dependencies.

For these 2 reasons, and because most fault-tree codes canautomatically handle -independent events (explicit models),it would be beneficial to be able to change an implicit modelto an equivalent explicit model and know how the probabili-ties should be determined when the joint probabilities

are known. This paper shows that this kindof transformation is possible when:

The system can be modeled as a fault-tree, basic events, andAND-gates and OR-gates;

The -dependencies can be presented as joint-probabilitiesof basic events being different from the products of individualevent probabilities.

Other techniques might be needed for other kinds of-de-pendencies, such as limited repair resources or coverage failures[17].

Section II develops exact relationships for groups of , 3,4 -dependent events. Section III develops application equationsfor standby safety-system analysis. Section IV presents compre-hensive equations and some numerical tables for groups of-de-pendent (repeatable) human errors. Section V analyzes systems

Page 3: Treatment of general dependencies in system fault-tree and risk analysis

280 IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 3, SEPTEMBER 2002

with CCF. Section VI develops rules for analyzing systems inspecial on-line situations with known failed states.

II. GENERAL IMPLICIT-TO-EXPLICIT TRANSFORMATION

In the explicit method, each component-level eventof agroup of mutually -dependent events in a fault-tree is re-placed with an OR-gate having as input all mutually-indepen-dent Boolean events , , etc., so that in each , 1sub-index equals. In this way, mutually -dependent events

are modeled with mutually -independent events,which can be real events or virtual surrogate events just for com-putation. The probabilities of these events are determinedso that all joint probabilities , , etc., arepreserved.

For , one has to define 3-independent events: , ,. The OR-gates and have the following inputs:

is input to ,is input to , and

is input to both and .This is equivalent to representing and with the Booleanequations:

From these follows: . The conditionsfor solving the probabilities: andin terms of and are:

In this case, an analytic solution is possible and leads to:

(1)

For , three single-failure events , three double-failureevents , and one triple-failure event need to be de-fined and quantified. It is advantageous to work with the successstates with probabilities

. These are mutually-independent just as theare. The basic idea in writing the equations for the is that

: is TRUE if and only if all (that haveas any one of the sub-indexes) are TRUE. Similarly any inter-section = is TRUE if and only if all , ,(that have or or both as a sub-index) are TRUE. Forthe 7 equations for solving 7 values in terms of are:

On the left hand side the “de Morgan rule” was applied,, and the exclusion–inclusion principle (Poincarés rule)

to get the intersection probabilities in terms of .The values of can now be easily solved by the ratios, e.g.,

Let , then:

(2)

For there are 15 probabilities tobe solved. The same number of equations can be developedby first calculating (by the inclusion–exclusion principle) thevalues for all combinations of un-equal ; :

(3)

These same probabilities can be written as products of allthat have any sub-index equal to any sub-index of

. This principle yields:

Page 4: Treatment of general dependencies in system fault-tree and risk analysis

VAURIO: TREATMENT OF GENERAL DEPENDENCIES IN SYSTEM FAULT-TREE AND RISK ANALYSIS 281

From these, solve the and then solve the

(4)

These are general equations to calculate exactly all probabilityvalues for the explicit model basic-events, whether real or virtual, in terms of the joint probabilities

.

III. STANDBY SAFETY SYSTEM ANALYSIS

Standby safety systems are usually dormant but tested peri-odically at intervals to discover and repair possible failuresthat entered between the tests. The system is supposed to re-spond when a true demand (initiating event) occurs at a randomtime. The probability of failure on demand is the time-dependentunavailability which, asymptotically, is a periodic function. Be-cause a true demand (initiating event) can occur at any time, itis customary to use the time-average unavailability of a compo-nent as the probability of a basic event. For a constant failurerate, , and -identical components, these probabilities are:

This section has assumed the usual situation, for anyfailure rates. The fact that complicates the system analysis (orPSA) is that using the as -independent basic-event proba-bilities leads to incorrect time-average system unavailabilities.This section shows how to avoid this by selecting correct jointprobabilities for the implicit model and -independentevent probabilities for the explicit model.

Standby components have many other failure modes (humanerrors, failures per demand, repair unavailability [5], [6]) that arenot considered here. For illustration, the problem is simplifiedby including only the time-dependent part of the unavailabilitybetween test episodes. Real common-cause failures are also ex-cluded in this section.

The system (or PSA) minimal cut-sets contain products ofcomponent-level basic events, and they should be correctlyquantified in the implicit method as:

(5)

In (5), the components are assumed to be-identical and to havea common hazard rate. All are -identical when the com-ponents are tested simultaneously (or consecutively with negli-gible delays), but they differ from each other by a time-shift iftests are staggered.

A. Consecutive Testing

For consecutive (or simultaneous) testing, for .

1) Implicit Method: Equation (5) yields the following prob-abilities:

(6)

Page 5: Treatment of general dependencies in system fault-tree and risk analysis

282 IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 3, SEPTEMBER 2002

2) Explicit Method: Equations (1)–(4) yield the followingprobabilities:

(7)

3) Summary:These results show that with consecutivetesting for one must model explicitly only single anddouble failures: assuming is a minorapproximation.

B. Staggered Testing

In this case:

1) Implicit Method: Equation (5) yields the following prob-abilities:

(8)

2) Explicit Method: Equations (1)–(4) yield the followingprobabilities:

(9)

3) Summary:As in Section III-A, only single and doublefailure events are essential. Negative values assigned to someprobabilities seem strange; but one should remember that theseare just numerical-values that yield correct system-level resultswhen used as “probabilities of the virtual-independent basicevents.”

C. Random Failure Rates

In Sections III-A and III-B, is assumed to be constant intime, and the same forcomponents. The next generalization isto consider a commonto vary randomly in time, from one testinterval to the next (or vary slowly so that it is nearly constantduring a single interval). In a best-estimate analysis this meansthat the , for (6) and (8), any should be replaced with

. This can appreciably change the , for (7) and (9).For example, for: and consecutive testing, then

( standard deviation of). If is of the same order of mag-nitude as the mean value, or larger, then ignoring the variationleads to very important under-estimation of the system unavail-ability.

The same formalism is valid if the failure rate is constantin time but different in various realizations, i.e., the rate is arandom sample from a distribution that then determines the

. In this case the uncertainty can be reduced as time goesby when more and more component and system-specific failuredata become available. The equations in this section are validat any time, given the uncertainty distribution and momentsavailable at that time.

1) Consecutive Testing:Replace any with in (6).Then (1)–(4) yield:

2) Staggered Testing: Replace each with E in (8);then (1)–(4) yield:

D. Approximate Transformations

Under the conditions:

it is possible to simplify the relationships of (1)–(4):

Page 6: Treatment of general dependencies in system fault-tree and risk analysis

VAURIO: TREATMENT OF GENERAL DEPENDENCIES IN SYSTEM FAULT-TREE AND RISK ANALYSIS 283

The solutions are:

etc

The equations in this section can be used to verify the results inSections III-A–III-C because the assumptions of LD are validthere. However, these simpler relationships are not generallyvalid e.g., for repeated human errors or for real common-causeevents that are described in Sections IV and V. In these lattercases the higher order terms can be nearly equal or even higherthan lower order terms, e.g., .

E. Component-Specific Failure Rates

The results in Sections III-A and III-B are based on the as-sumption that all components have the same failure rate,.Experience has shown that sometimes even nominally identicalcomponents can have appreciably different failure-rates. Thiscan be easily accounted for by replacing anyin anywith the product , without changing the numericalcoefficient in front. Section III-D shows that the same rule ap-plies to each : , , , ,

, etc.

IV. DEPENDENTREPEATABLE HUMAN ERRORS

Notation

failed state of componentdue to a human-error intask .

The tasks are consecutive and ordered-in-time according to. Typical tasks are calibrations or tests of redun-

dant trains of a safety system. Possible errors are miscalibrationsor failures to realign valves or switches after a test. Repeatederrors are likely in sequential maintenance tasks if human re-dundancy or time-separation are not applied between redundanttrains. The probability of an error in task #1 is

, and is the conditional probability of repeating the errorfor the time , given that it has just occurred exactlyconsecutive times in the current maintenance cycle. Thus,

...

It is also necessary to define conditional probabilities after asuccessful task, and after several consecutive successes.

Assumption

The conditional probability of an error just after a successis: , -independent of what happenedbefore that success.

indicates: there is no success-dependency (successdoes not promote success),

indicates: maximum-dependency.With these assumptions/definitions and standard proba-

bilistic calculations, the following general equations for jointerror probabilities [7] are derived:

(10)

Numerical values calculated from (10) can be inserted into thegeneral transformation equations (1)–(4) to obtain values for thebasic events of an explicit model for any .

It can be shown [7] that these rather general equations yieldthe Handbook formalism [8] if one considers the special cases:

and the relationship of and is

where for -dependency levels: ZD, LD,MD, HD, CD, respectively.

Reference [8] also gives rules for selecting the value ofand the -dependency level; they are not repeated here. Table Ilists numerical values of joint probabilities

for and various -dependency levels for. Table II gives the corresponding ,

obtained by using (10), (1), (2).The symmetry is not valid for human errors. For example,

, and for . Nominally identicalcomponents are not-identical with respect to human errors,and probabilities depend on the system size. The following rulesshould be observed:

Page 7: Treatment of general dependencies in system fault-tree and risk analysis

284 IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 3, SEPTEMBER 2002

TABLE IJOINT HUMAN ERRORPROBABILITIES, FORq = 0:01

TABLE IIEXPLICIT HUMAN BASIC-EVENT PROBABILITIES FORq = 0:01

1) For ZD, it is sufficient to model only -independentsingle error events, , with for all components,

.2) For CD, it is sufficient to model only 1 total error event,

with .3) Numerical tables [(10), and (1)–(4)] are needed for the

intermediate -dependencies: LD, MD, HD. For thevalues in Tables I and II can be scaled down proportional to.

V. STANDBY SYSTEMSWITH COMMON CAUSE FAILURES

Common cause events, , are considered (in this paper)to be events that simultaneously fail exactly specific compo-nents: , and no others. When CCF analysis is applied tostandby (inactive) systems, the causes, , are consideredmostly external and mutually-independent; thus any 1 CCFhaving occurred does not preclude others from occurring. Thismight be a slightly conservative assumption, at most, but is con-sistent with the assumption that the events are mutually-inde-pendent.

It is quite customary to consider CCF explicitly assuming thattheir probabilities are symmetric and “time-independent per de-mand,”

etc;

etc.

Always remember that the probabilities depend on, althoughthis is not indicated by the sub-indices of the. There are manyparametric CCF-models in the literature, most of which yieldvalues or formulas for the explicit model-probabilities .

Equations for needed in implicit models can besolved in terms of from the equations in Section II.Under the commonly valid assumption that all

(rare event approximation) without conditions for the relativemagnitudes of , then:

(11)

Three problems with this formalism in (11) are:1) It assumes completely identical units and symmetry, even

if in practice nominally -identical units can have quite differentrates or probabilities .

2) Probabilities per demand do not indicate any dependenceon the length of a test interval nor how the probabilities dependon staggering of the tests. With staggering, and

for , as stated in Section V-A.3) Using constant probabilities make it impossible to opti-

mize test intervals: Extending the interval to infinity (termi-nating the weekly or monthly testing practice that is commonat nuclear power plants) would not change anything in this for-malism.

A. General Multiple Failure-Rate Models

An alternative to per-demand probabilities is to consider gen-eral multiple-failure rates, , , , etc., so thatis the probability of failure of exactly components indue to a common-cause. These failures would remain latent in astandby system until discovered by a scheduled test. For simul-taneous (consecutive) testing, the average residence-time of thefailures is approximately 1/2 of the test interval, which gives ageneral first-cut approximation:

(12)

With staggered testing the average residence-time is generallyshorter, especially if there is an ETR as follows: whenever acomponent is found failed, the other trains are also testedto discover and repair any common-cause failures. The sameeffect results if the characteristics of a failure always indicatewhether it is a CCF, and all simultaneous failures are repairedwhenever 1 failure is discovered by a test.

Exact derivation of all time-average values andalong the lines of Sections III-A and III-B, now with

all multiple-failure rates , would be a tremendous task.Approximate expressions can be obtained by considering theaverage residence-times of each failure combination undervarious testing schemes. With consecutive-testing and stag-gered-testing with ETR, the residence-time ends with the very

Page 8: Treatment of general dependencies in system fault-tree and risk analysis

VAURIO: TREATMENT OF GENERAL DEPENDENCIES IN SYSTEM FAULT-TREE AND RISK ANALYSIS 285

first test that finds a failed component. Under the condition, the results are:

1) Consecutive or Simultaneous Testing:

for any

2) Staggered Testing With Identified CCF (or With ETR):

(13)

Thus, staggered testing is better for all -systems .The complete (-fold) CCF terms are 3 and 4 times “weaker”with staggered testing than with consecutive testing, for 3and 4, respectively.

3) Staggered Testing Without ETR:For staggered testingwithout CCF identification (and without ETR), one has to con-sider different system-success criteria for each. The effectiveresidence-time of a CCF failure-combination can depend onthat. In a 1/3-system (1-out-of-3 :) a system failure due to

is removed by the very first test (average residence time:), while in a 2/3-system it is removed only after 2 tests

(average residence time: ). After detailedderivations (not given in this paper), the following conclusionscan be drawn:

1) For all -systems, all without ETR are the sameas with ETR.

2) For other systems the are the same as with ETR(13), except for the following items for :

2/3-system:

2/4-system:

3/4-system:

(14)

The rates generally depend onso that , for example, isnot the same for . In a symmetric case, it is advisableto use the notation:

etc.

With ,

even in a symmetric case because of different mutual staggeringof components 2 and 3 with respect to component 1, and ofcomponents 3 and 4 with respect to component 2.

4) Extension, Estimation, and Approximation:The resultsin this section give all probabilities needed in explicit models.For implicit models with symmetry, obtain approximate resultsby using (11):

etc.

with 1 exception: for there are 2 different of valuesbecause , etc., and might use the weighted average:

The accuracy of the of (12)–(14) can be judged bycomparison with the analytic time-average unavailabilities,

obtained [9] under the symmetry assumption,

etc.

for 3 different testing schemes:

1) simultaneous (consecutive),2) staggered without ETR,3) staggered with ETR.

Such comparisons and other approximations [10] indicatethat the current results yield accurate system-results whenevercommon-cause failures dominate the system unavailability.Especially the linear terms (e.g., ) are accurate, and theother terms within 20%.

When the single-failure terms are impor-tant, then the accuracy can be improved by adding to thethe corresponding “synchronization” terms from (7) or (9).

The advantage of the GMFR model described here is that it al-lows studies about various testing schemes and about optimiza-tion of the test interval [11]. The rate-parameters can be easilyestimated based on the number of events, , observed overthe system observation time, :

It is statistically possible to use empirical Bayes to combine datafrom several sources [12].

The model can often be simplified by including only indi-vidual single failure events , and one “macro” basic event

that fails the system.For a 1/3-system, ; for a 2/3 system,

; etc. This simple SFRMhas less parameters to be estimated from field data [13].

The concept of residence-times can be extended to common-cause failures that are design-weaknesses or “remain undetectedfor extended periods.” These can be modeled by constant values,

as illustrated with an auxiliary feedwater system pilot study[5] in 1980 (later called “basic parameter” modeling).

B. Economic Optimization

A simple example of optimization is a standby -systemdesigned to respond to initiating events (demands) that occurwith frequency . The cost of testing one train is , and thecost of repair is . The system fails, on average, to respondproperly with probability , and in such a case an acci-dent occurs with cost . The total average cost rate is

(15)

Because is a regular increasing function of, an op-timal exists that minimizes . To get an idea about the role

Page 9: Treatment of general dependencies in system fault-tree and risk analysis

286 IEEE TRANSACTIONS ON RELIABILITY, VOL. 51, NO. 3, SEPTEMBER 2002

TABLE IIICOEFFICIENTSa �= U (T )=T

of the parameters in this optimization, consider a typical casesuch that the linear common-cause terms dominate the systemunavailability,

(16)

the constant-factor depends on system configuration andtesting scheme. The can be calculated using (12)–(14).Table III lists the results for the symmetric case. With (15) and(16), the optimum test-interval is

(17)

Thus, the optimal increases with and , and decreaseswith increasing ; it does not depend on the repair cost. If thereis an administrative upper-limit for , then that limit mightdictate a shorter than in (17). In practice, the problem is oftenmore involved with multiple systems, initiators, and intervals[11].

If the unavailability is modeled with parameters inde-pendent of , then an unrealistic optimal test-intervalresults from (15).

VI. SPECIAL CONFIGURATIONS:EVENTS WITH KNOWN FAILED STATES

In a special configuration-assessment, it might be known thata certain component is failed or under maintenance, and the in-terest is in knowing the system-reliability or system-risk undersuch condition. If the basic event is not a member of any groupof mutually -dependent events, it is easy enough to set the cor-responding TRUE in a fault-tree, or set in thesystem probability equation. This often applies to maintenanceoutages. Many computer-codes even calculate importance fac-tors, e.g., Risk Achievement Worth, that directly answer thequestion. The same methods apply if one knows a specific ex-plicit common-cause event to have occurred.

The situation is more problematic if the component is amember of a group of mutually-dependent components, butone does not know whether it is down due to a single failure ordown due to a specific common-cause event. One has to knowhow to change the joint probabilities as a consequence of thelimited knowledge. The rules for the changes are different inthe implicit and explicit methods, and the rules depend on whatis known about the causes influencing other components ofthe group. Some guidance has been offered for explicit models

[14]. Section VI-A extends guidance to implicit models and tosome other cases.

A. Failure is Known to Be Component-Specific

If it is known that component is failed due to causes af-fecting only that component, and nothing is known about theother components of the group (e.g., in case of a standby systemwhen a test reveals one component to be failed), the situationis easily managed in an explicit model by setting the corre-sponding TRUE, as described at the beginning of Sec-tion VI, and leave the other unchanged.

For an implicit model, one has to replace each “joint proba-bility where is a factor” by the conditional probability, given

TRUE. Thus, each

is replaced by

[Because and is -independent of all , when.] In other words, one can simply delete from each term,

which gives the numerical replacements for the probabilities:

etc;

i.e., delete the sub-indexto find out which numerical value touse in the implicit model terms.

B. Failure is Not Known to Be Component-Specific

The situation is different if one knows only that componentis down but does not know whether it is due to a single-failurecause, , or due to some common-cause event thatcould have failed component. In this case, one actually knowsthat TRUE. In the implicit model this means that a jointprobability

[has as any sub-index] must be replaced with the conditionalprobability

Thus, the rule:

Page 10: Treatment of general dependencies in system fault-tree and risk analysis

VAURIO: TREATMENT OF GENERAL DEPENDENCIES IN SYSTEM FAULT-TREE AND RISK ANALYSIS 287

each joint-probability in which is a member, must be dividedby , the total probability of componentfailure (due to themodes of failure included in the-dependency).

In the explicit method, each

must be replaced with the conditional probability

[Because ; or because.]

Thus, the rule:

etc.

Such rules for certain explicit CCF-models were used in [14]and [15]. The value is not automatically available in the ex-plicit method. One can often use , or more exactly

the product or sum is over all with any sub-index equalto .

Numerical Examples:1) A group of 2 standby diesel-gener-ators with specific explicit failure probabilities (unavailabilities)

The implicit model-probabilities are then

For a redundant 2100% (1-out-of-2 : ) system, the systemunavailability is .

If one knows that diesel generator #1 is down due to “indi-vidual causes not related to common causes,” then the systemunavailability is , by the rule of Section VI-A.However, if it is not known whether generator #1 could be downdue to a common-cause, then the system unavailability is

This is more than 13 times as large as in the previous case. Thelack of knowledge has a major impact in on-line analysis.

Consider repeatable human errors in a 1-out-of-3 :systemwith HD and the values of Table I. The unavailability dueto errors is usually . If an error is foundin any single train , the system unavailability increases to

. If trains 1 and 2 (or 2 and 3) are foundfailed, the system unavailability becomes .If trains 1 and 3 are failed, the middle train 2 is down withprobability .

C. Other Cases

Sometimes it is possible to know that 1 component (compo-nent ) is failed but the other components of the group areunfailed. This is possible by testing all redundant components

once 1 component is found failed. Then one actually knows thatTRUE and all other events FALSE within the

group. In the explicit model, these can be accounted for in thefault-tree or by setting and all other .

In the implicit model, this situation means TRUE andFALSE for all . This can be accounted for by setting, and deleting all terms that have any .

However, in standby systems, a component is known to begood only momentarily, just at the time of a test. From that pointon, latent (hidden) failures are possible and therefore the situa-tion is more like that of Section VI-A.

If one knows the status of more than 1 component (event), onehas to derive the conditional joint probabilities under multipleconditions in the same way as in Sections VI-A and VI-B for asingle condition.

REFERENCES

[1] N. H. Roberts, W. E. Vesely, D. F. Haasl, and F. F. Goldberg,Fault TreeHandbook: US NRC, 1981, NUREG-0492.

[2] Anon, PRA Procedures Guide: US NRC, 1983, NUREG/CR-2300.[3] K. N. Fleming and A. Mosleh, “Common-cause data analysis and im-

plications in system modeling,” inProc. Int. Topical Meeting on Prob-abilistic Safety Methods and Applications, vol. 1, EPRI NP-3912-SR,1985, pp. 3/1–2/12.

[4] J. K. Vaurio, “An implicit method for incorporating common-cause fail-ures in system analysis,”IEEE Trans. Reliability, vol. 47, no. 2, pp.173–180, Jun. 1998.

[5] , “Availability of redundant safety systems with common-modeand undetected failures,”Nuclear Engineering and Design, vol. 58, pp.415–424, 1980.

[6] J. K. Vaurio and D. Sciaudone, “Unavailability analysis of redundantsafety systems,” inProc. 1980 Reliability Conf. Electric Power Industry,1980.

[7] J. K. Vaurio, “Modeling and quantification of dependent repeatablehuman errors in system analysis and risk assessment,”ReliabilityEngineering and System Safety, vol. 71, pp. 179–188, 2001.

[8] A. D. Swain and H. E. Guttman,Hdbk. of Human Reliability Analysiswith Emphasis on Nuclear Power Plant Applications: USNRC, 1983,NUREG/CR-1278.

[9] J. K. Vaurio, “The effects of testing arrangements on the unavailabilityof standby systems,” inProc. PSA’1993, vol. 1, pp. 654–660.

[10] , “The theory and quantification of common cause shock events forredundant standby systems,”Reliability Engineering and System Safety,vol. 43, pp. 289–305, 1994. (Note: Eq^U in table 1 has typographicalerrors; it is correct in [9].).

[11] , “Optimization of test and maintenance intervals based on risk andcost,” Reliability Engineering and System Safety, vol. 49, pp. 23–36,1995.

[12] K. E. Jänkälä and J. K. Vaurio, “Residual common cause failure anal-ysis in a probabilistic safety assessment,” inProc. PSA’1993, vol. 2, pp.804–810.

[13] J. K. Vaurio, “Procedure for common cause failure assessment,” inProc.PSA’1991: Int. Atomic Energy Agency, 1992, (IAEA-SM-321/46), pp.505–515.

[14] A. Mosleh, D. M. Rasmuson, and F. M. Marshall,Guidelines on Mod-eling Common-Cause Failures in Probabilistic Risk Assessment: USNRC, 1998, appendix E, NUREG/CR-5485, INEEL/EXT-97-01 327.

[15] Z. Simic, V. Mikulicic, and Z. Hebel, “On-line operation supportmodeling: Common cause dependency,” inProc. PSA’1999, vol. 1, pp.128–132.

[16] J. B. Dugan, “Experimental analysis of models for correlation in multi-version software,” inProc. 5th Int’l Symp. on Software Reliability Eng’g(ISSRE), 1994, pp. 36–44.

[17] S. V. Amari, J. B. Dugan, and R. B. Misra, “A separable method for in-corporating imperfect fault-coverage into combinatorial models,”IEEETrans. Reliability, vol. 48, no. 3, pp. 267–274, Sep. 1999.

Jussi K. Vaurio (SeeIEEE Trans. Reliability, vol. 48, no. 3, p. 214, Sept. 1999.)He also teaches at Lappeenranta University of Technology, Finland.