rcm implementation

7
A Cons tmc tive Critique of Re 1 i ab ility - C entered Maintenance David J. Sherwin 0 Lund University Institute of Technology Lund. Key Words: Maintenance, RCM, Life-cycle costs, Terotechnology. SUMMARY & CONCLUSIONS Maintenance should be based on the intrinsic RAM properties of the machinery to be maintained, and cost- optimized. Because maintenance acts on parts, data collection and analysis must also be at that level. Data costs are falling and optimization methods are improving, yet the maintenance industry still resists change. This is at least partly because the books on RCM contain some wrong ideas which spoil it as a basis for terotechnological investigation and amelioration. The paper first demolishes some of the tenets of RCM, then shows how these myths have delayed progress and finally makes suggestions for a system of maintenance based more truly upon reliability. The points are illustrated by examples. 1. INTRODUCTION The Concise Oxford Dictionary defines a fad as “A pet notion or rule of action, a craze, a piece of fancied enlightenment ”. Quality, Reliability and Terotechnology have a long history of fads. In Quality there were Quality Costs, Quality Circles (1970’s), Taguchi Methods (1980’s), IS0 9000 (1990’s). In Reliability there were the Bathtub Curve (1960’s), FMECA (1970’~)~ Bayesian Methods, (1980’s). In Terotechnology we have RCM . All have some good features, but none is a panacea. Exclusive reliance on them is dangerous. All can be misused or used out of context. The paper shows that RCM is a fad by examining its weaknesses and errors, and then suggests some more effective methods. [CICM AR CF7CM FMECA LCC/P MSG3 OEM PM RCM ROCOF I3 2. GLOSSARY [Continuous] Condition Monitoring Age Renewal at cost-optimized intervals Total cost of a Failure, PM Failure Modes Effects & Criticality Analysis Life-cycle Costs & Profits Latest airline version of RCM Original Equipment Manufacturer Preventive Maintenance - any action to prevent failure. Reliability-centered Maintenance as described in Rate of Occurrence of Failures as per [4] Weibull distribution shape parameter 11-31 Terotechnology [ll] is defined -as- ‘‘A combination oj management, financial, engineering, building and other practices applied to physical assets in pursuit of economic life-cycle costs, This could now with advantage be amended to ... life-cycle proJts ”. 3. BRIEF DESCRIPTION OF RCM RCM is described fully by Nowlan & Heap, MIL-STD’s, and Moubray, [ 1,2,3]. RCM is a good idea spoiled rather than a wholly bad scheme, though the good parts of it are not all original whereas the faulty ones, apart from the Bathtub Curve confusion, see below, generally are specific to RCM. RCM purports to be a procedure for discovering what maintenance is required by an asset in its operating context, in particular what must be done to ensure that it continues to provide its intended functions to its owner. But this should not be the sole aim; maintenance is an economic rather than just a reliability problem. In outline the RCM procedure is :- a) Define the system’s functions b) Define failure modes relative to these functions c) Carry out FMECA d) How can failure modes be prevented? e) If prevention is not possible, what should be done? Generally, this procedure is carried out by groups of engineers, technicians, and operators familiar with the plant to be maintained, with the expectation that they will advise less maintenance that requires stopping and dismantling, but more focused on function. They use what data there are, but if there is none, then they rely on experienced estimates; if a failure mode has never been recorded, they tend to assume that no maintenance is needed to prevent it, despite the possible operation of such a PM routine during the period of no failure. The criterion is reliability of function, not economics. The investigation groups contain both senior and junior personnel, contrary to the accepted theory of small-group dynamics, e.g. Quality Circles. The author has noted in two by-the-book RCM exercises that, however hard they try not to, senior people tend to bully juniors into agreeing to cuts in the PM which may save money for a while, but would eventually prove detrimental. Also, their understanding of the design may be inadequate to eliminate or open out schedules safely. In a Decision Diagram, failures are classified as Function Loss, SafetyEnvironmental, Hidden Faults and Others (which do not directly affect functionality). Why this is considered so pimportant is unclear, because the Decision Charts then advise almost the same procedure for each class, i.e. to examine, in order, the feasibility (rather than cost) of, CM (running), PM (to restore, or renew), Inspection at Intervals (stopped, Hidden Faults only), and as a last resort, re-design. The first feasible solution is to be accepted, except in MSG3, which also puts 238 0-7803-5143-6/99/$10.00 0 1999 IEEE 1999 PROCEEDINGS Annual RELIABILITY and MAINTAINABILITY Symposium

Upload: wdmalik

Post on 12-Dec-2015

14 views

Category:

Documents


2 download

DESCRIPTION

RCM Implementation Techniques

TRANSCRIPT

Page 1: RCM Implementation

A Cons tmc t ive Critique of Re 1 i ab i lity - C entered Maintenance

David J. Sherwin 0 Lund University Institute of Technology Lund.

Key Words: Maintenance, RCM, Life-cycle costs, Terotechnology.

SUMMARY & CONCLUSIONS Maintenance should be based on the intrinsic RAM

properties of the machinery to be maintained, and cost- optimized. Because maintenance acts on parts, data collection and analysis must also be at that level. Data costs are falling and optimization methods are improving, yet the maintenance industry still resists change. This is at least partly because the books on RCM contain some wrong ideas which spoil it as a basis for terotechnological investigation and amelioration. The paper first demolishes some of the tenets of RCM, then shows how these myths have delayed progress and finally makes suggestions for a system of maintenance based more truly upon reliability. The points are illustrated by examples.

1. INTRODUCTION The Concise Oxford Dictionary defines a fad as “A pet

notion or rule of action, a craze, a piece of fancied enlightenment ”. Quality, Reliability and Terotechnology have a long history of fads. In Quality there were Quality Costs, Quality Circles (1 970’s), Taguchi Methods (1980’s), IS0 9000 (1990’s). In Reliability there were the Bathtub Curve (1960’s), FMECA (1970’~)~ Bayesian Methods, (1980’s). In Terotechnology we have RCM . All have some good features, but none is a panacea. Exclusive reliance on them is dangerous. All can be misused or used out of context. The paper shows that RCM is a fad by examining its weaknesses and errors, and then suggests some more effective methods.

[CICM AR CF7CM FMECA LCC/P MSG3 OEM PM

RCM

ROCOF I3

2. GLOSSARY [Continuous] Condition Monitoring Age Renewal at cost-optimized intervals Total cost of a Failure, PM Failure Modes Effects & Criticality Analysis Life-cycle Costs & Profits Latest airline version of RCM Original Equipment Manufacturer Preventive Maintenance - any action to prevent failure. Reliability-centered Maintenance as described in

Rate of Occurrence of Failures as per [4] Weibull distribution shape parameter

11-31

Terotechnology [ll] is defined -as- ‘‘A combination oj management, financial, engineering, building and other practices applied to physical assets in pursuit of economic life-cycle costs,

This could now with advantage be amended to “ ... life-cycle proJts ”.

3. BRIEF DESCRIPTION OF RCM RCM is described fully by Nowlan & Heap, MIL-STD’s,

and Moubray, [ 1,2,3]. RCM is a good idea spoiled rather than a wholly bad scheme, though the good parts of it are not all original whereas the faulty ones, apart from the Bathtub Curve confusion, see below, generally are specific to RCM.

RCM purports to be a procedure for discovering what maintenance is required by an asset in its operating context, in particular what must be done to ensure that it continues to provide its intended functions to its owner. But this should not be the sole aim; maintenance is an economic rather than just a reliability problem. In outline the RCM procedure is :- a) Define the system’s functions b) Define failure modes relative to these functions c) Carry out FMECA d) How can failure modes be prevented? e) If prevention is not possible, what should be done?

Generally, this procedure is carried out by groups of engineers, technicians, and operators familiar with the plant to be maintained, with the expectation that they will advise less maintenance that requires stopping and dismantling, but more focused on function. They use what data there are, but if there is none, then they rely on experienced estimates; if a failure mode has never been recorded, they tend to assume that no maintenance is needed to prevent it, despite the possible operation of such a PM routine during the period of no failure. The criterion is reliability of function, not economics. The investigation groups contain both senior and junior personnel, contrary to the accepted theory of small-group dynamics, e.g. Quality Circles. The author has noted in two by-the-book RCM exercises that, however hard they try not to, senior people tend to bully juniors into agreeing to cuts in the PM which may save money for a while, but would eventually prove detrimental. Also, their understanding of the design may be inadequate to eliminate or open out schedules safely.

In a Decision Diagram, failures are classified as Function Loss, SafetyEnvironmental, Hidden Faults and Others (which do not directly affect functionality). Why this is considered so

pimportant is unclear, because the Decision Charts then advise almost the same procedure for each class, i.e. to examine, in order, the feasibility (rather than cost) of, CM (running), PM (to restore, or renew), Inspection at Intervals (stopped, Hidden Faults only), and as a last resort, re-design. The first feasible solution is to be accepted, except in MSG3, which also puts

238 0-7803-5143-6/99/$10.00 0 1999 IEEE

1999 PROCEEDINGS Annual RELIABILITY and MAINTAINABILITY Symposium

Page 2: RCM Implementation

Inspection before PM in the order of consideration, and extends Inspection to modes other than Hidden Faults. That these alterations were made indicates the inadequacy of the original charts, which nevertheless remain in widespread use.

4. HISTORY & BATHTUB MISCONCEPTIONS RCM is a child of the aircraft, and more particularly the

airline, industries. Airliners have redundant machinery and control systems, and structures designed to tolerate minor damage without danger. Airlines found in the 1950’s that increasing the intensity of overhauls to engines, other machines and avionics did not increase reliability. The U.S. Federal Aviation Authority’s 1960 investigation was to analyze the factors affecting reliability and the efficacy of PM. It “confirmed” that “scheduled overhaul had little eflect on overall reliability of complex items unless there was a dominant mode, and there are many items for which there is no eflective form of scheduled maintenance.. ”.

This of course is nonsense. It arises from misconceptions about the Bathtub Curve, see Ascher & Feingold, [4]. Figure 1 shows how repair of only the failed part leads to randomized part ages and a pseudo-Poisson process for the system as a whole. The raised ends are due to initial quality and training problems and wear-out of longer-lasting parts respectively.

System ROCOF

W S

System Lifetime T

Figure 1. Bathtub curve for Repairable System

We can be sure that they never tried operating the aircraft entirely without such overhauls! If they had, they would have found that the reliability fell. What actually occurred was a switch from age to condition-based maintenance, which depended heavily upon better data collection and analysis, see for example Cole, [13]. They apparently did not investigate the quality of the workmanship, the introduction of new faults at PM, as a reliability factor either in 1960 or in the later investigations leading to the formulation of RCM, see Sherwin & Lees, [5 ] . Moubray’s book, [3], for example, implies several times that the system or machine Bathtub Curve is inherent, when it is obviously shaped and scaled by the maintenance policy, and should not therefore be used to set the policy, [4,5]. More generally, the investigations were empirical and made no attempt to reconcile wear-out and AR theory at the part level with apparently Poisson failure patterns to machines. This was excusable in 1960, given the state of the art of Reliability Theory, but not since [4]. It is true now, though possibly not in 1960, that some electronic systems are best left alone, because either they are inherently reliable due

to the low, smooth, loading, or have very low constant or falling hazard rates because failures are due to residual manufacturing quality faults and random voltage peaks. However, given the need to save weight, progressive effects in mechanical parts, such as metal fatigue, corrosion, wear and creep are inevitable and usually amenable to CM, Inspection whilst stopped, or AR. But it is parts not systems which, given enough data, are amenable to PM optimization. The reversing returns recorded under increasing overhaul frequency probably arose as follows. Some overhaul routines call for inspection and renewal of worn parts according to the judgment of the technician, others for renewal regardless of condition, still others for an exchange with a whole machine withdrawn from another identical system and overhauled at leisure, but none for the complete renewal of the machine which would justify treating it as statistically equivalent to a part. E.g, a Weibull analysis of bus gearboxes by Kelly, [6] had shape parameter p = 2.5 for first failures from new, but /3 = 1.1 for subsequent failures. From new until first failure, all the part hazard rates are additive, but after overhaul, pseudo-Poisson failures would be expected because the wearing parts are then of different ages, different parts being renewed in each gearbox.

In CM we assess whether the part will endure to the next renewal opportunity, so there is a tendency to renew at about the same intervals regardless of the frequency of checks, provided that this prevents most failures. In both this and the other cases, some bad quality parts are fitted, which fail early, and also some badly fitted renewals occur (poor workmanship). These are reducible by quality control and training, but are often not seen for what they are because of the confusion between part and system Bathtub Curves, and so appear to give negative reliability returns from more frequent overhauls. However, with good work and good spares, reliability would increase with overhaul frequency. There would, of course be a turning point with respect to availability.

The other important “finding” of the 1960 investigation was that “There are many items for which there is no effective form of scheduled maintenance”, [1,2,3] This is a direct indication that the bathtub confusion dominates RCM theory, confirmed by the reference to the need for “a dominant mode ”. All frequently-failing parts, either give detectable signs that they are about to fail, or else have rising hazard rate functions. The next question is whether the costs justify pre- emptive or on-condition renewal. Few such parts fail this test, because the costs and their absolute difference can be quite small, it is the ratios which determine, with the distribution form, whether such work is worthwhile.

The books on RCM all show six variations on the Bathtub Curve, see Figure 2 . The axes of these graphs are marked, if at all, as “(Conditional) probability of failure versus Time”. They do not try to distinguish between system and part time-scales, and policy recommendations are developed from the curves without regard to any economic factors, as if the prevention of “most of the potential non-random failures” were the only criterion of success. By random failures RCM means constant conditional probability, but whether that in turn implies part hazard rate or system ROCOF is never clear.

1999 PROCEEDINGS Annual RELIABILITY and MAINTAlNABlLlTY Symposium 239

Page 3: RCM Implementation

The “evidence” upon which these failure patterns are based is the same as that upon which it was concluded that overhauls did not improve reliability unless there was a dominant mode of failure. This suggests that they remain confused between systems and parts, ROCOF and hazard rate.

Discussing Pattern A, Moubray, [3] states that two or more modes are operating and that each must be dealt with separately, but he fails to acknowledge that the central portion may be the result of part renewals in a system.

In Pattern B, the initial flat portion is attributed to “random factors ” which cause ‘yaster wear than usual” in a part with a

Figure 2. RCM’s Failure Patterns

three-parameter Weibull distribution, once more indicating bathtub curve confusion.

The 1960 data, in which these patterns were all identified are mainly for systems which were overhauled with pseudo- Poisson failures between overhauls. The conclusion that renewal should occur just before the onset of relatively rapid wear-out is of course sub-optimal even for parts; if data suffice to identify this pattern then they suffice also for distribution analysis which separates the Poisson and wear-out modes and permits optimization of the cost rate. RCM texts say that Pattern C may apply to parts failing by metal fatigue. Here again, it is unclear in [ 1,2,3] whether the time scale is part life or system-time-since-overhaul, with the fatigue failure as the dominant mode triggering overhaul. It is easy to show that a straight-line hazard rate implies a Weibull shape parameter p =2. Fatigue failures are often Lognormal.

Pattern D is said to correspond to 1 < p < 2 in the Weibull form. The author’s own data, gathered at chemical plants in Britain in the 1970’s, [5 ] found centrifugal pumps with system ROCOF bathtubs of this shape, presumably because there was good manufacturing quality control in an established design. It also seems plausible that avionics systems of the 1950’s would exhibit this pattern of ROCOF. They were “burned in“, repaired by exchange, sent to a workshop to find and replace the failed part(s), then placed on the shelf for the next time.

The description of Pattern E, together with reference back to the data of the 1960 study, confirms that RCM specialists are definitely confused between hazard rate and ROCOF. They are also confused between true Poisson failures, which are by nature completely unpredictable individually, and failures which give detectable warning and are amenable to CM but not to AR because of their highly variable times to the start of detectable deterioration. Moubray [3] cites the example of rolling contact bearings. These are miniature systems which are renewed as parts; that is they have several potential failure

modes which compete and combine to cause failure. Weibull analysis of such bearings shows multiple modes which can be separated graphically. It is wrong to draw the Weibull plot and declare that the failures are random because the initial slope is about unity. Ball bearings do wear out, and elsewhere in his book Moubray describes roughly how that happens in respect of fatigue failures to the outer race. He states that the interval from detectable warning becoming available and actual failure (his P-F interval) is reasonably constant and that it “should not be necessary to take additional readings a@er theJirst sign of deviation is discovered ... should only be tracked if the process of deterioration is poorly understood” Actually, it can be shown in practical cases such as the roller bearings in paper mills, [7], that the P-F interval is variable and that it pays to increase the frequency of readings when deterioration is first detected. In some cases continuous monitoring, either of the last phase or as the only policy, is economically best, [SI. Moubray is right to claim that better understanding of the failure process could lead to more accurate assessment of the time remaining, but so far, more precise analysis of the vibration frequencies relative to the rotational speed has led researchers to identify more (and more complex) failure modes. Money is wasted if more readings are not taken; the vibration level is quite likely to fall again, and the rate of deterioration varies between and within failure modes.

Finally, Pattern F is presented as the most common shape in the 1960 airline study data, (68%). They call it “infant mortality”, a phrase normally associated with parts, but then describe correctly the usual system effects, including poor maintenance workmanship, once again indicating basic confusion between the two types of bathtub curve. However, some RCM texts go on to perpetuate the myth of “too much preventive maintenance” as a cause of “decreasing failure rate”. This is a pity because one would have expected more logical development of the better workmanship theme, [ 5 ] . It is easy to show fast (but temporary!) savings, by advising less PM rather than more training and better supervision. Bad work in a new system is inexperience rather than carelessness or over-maintenance. Talk of “unnecessary or unnecessarily invasive” routines is unhelpful; OEM’s call for such early routines against the special problems of newness, such as the first change of the oil in a new engine. One car-hire company buys its cars new, ignores running in and all the early maintenance routines, and sells them at 50, 000 miles without even changing the oil or taking the first free service. They get little trouble, but the second owners do! This is not to say that the frequency and basic need for routines should not be challenged; unfortunately many OEM’s make up money on spare parts that they lost on competitive pricing. But it should be done on the basis of data analysis and engineering investigation, including asking the OEM to justify his schedules. Hyper-exponentially distributed failures at the part level are due to bad workmanship or poor quality spares, [ 5 ] . At the machine or system level, the overall ROCOF settles to a constant value that is higher than it needs to be, and a reduction in PM frequency may sometimes lead to a temporary improvement, but permanent improvement is achieved by

240 1999 PROCEEDINGS Annual RELIABILITY and MAINTAINABILITY Symposium

Page 4: RCM Implementation

training maintainers in fitting techniques and spare parts inspection. We showed in three diverse situations, (fertilizers, petro-chemicals and hospital autoclaves) that this was so, [SI. In each case, an increase in effective PM frequency saved money. Standards have improved since 1980 and we do not claim that this would always be so, now or then. But it is worrying that the RCM analysis is being applied to aircraft, nuclear plants, ships etc., to reduce the cost of maintenance.

5. THE DATA PROBLEM AS SEEN BY RCM. RCM refutes the need for data for both rare and very

common events and is wrong in both cases. RCM fudges the data issue for rare but very serious events as follows, “The acquisition of the information thought to be most needed by maintenance policy designers - information about critical failures - is in principle unacceptable and is evidence of the failure of the maintenance program. This is because critical failures entail potential (in some cases, certain) loss of life, but there is no rate of loss of life which is acceptable to an organization as the price of failure information to be used for designing a maintenance policy. Thus the designer is faced with the problem of creating a maintenance system for which the expected loss of life will be less than one over the planned operational lifetime of the asset. This means that, both in practice and in principle, the policy must be designed without using experiential data which will arisepom the failures which the policy is meant to avoid.”, [9] “Resnikoff s Conundrum”, above, is treated as a profhdity in RCM circles, but in fact (fatal) accidents are coincidences. The calculated probability of the accident is made very small by design; it is the product of much larger constituent event probabilities, which can be estimated from data collected from previous similar systems. If the problem is significant, there will be adequate data, and if it is not, then the censored data are reassuring. But if data are not collected then there can be no statistically sound assurance. Censorings are also data; operation without failure is relevant.

Example. those risking bungy-jumping are assured that an inert dummy is the first to jump each day, and that even he does not jump before the rope has been visually inspected. The attachments and anchor point are double-checked. Moreover, the rope is changed anyway after a fixed number of jumps, and is manufactured to a strict standard. Estimates of the probabilities of failure of equipment and procedure can be made, and multiplied together to form an estimate of the probability of an accident, which is more accurate than “known deaths / known jumps” in similar but non-identical situations. But even if there have been no fatalities, it is still possible to estimate the upper limit of probability from the number of successful jumps. Suppose there have been 1000 jumps and no fatalities. On a Poisson assumption, the best we can make in the circumstances, the 95% upper limit of the failure probability p is 0.003/jump.

Pr(0) = e-1000p095 = 0.05 p0.95 = 0.003 (1)

Moubray, [3], also argues as follows against collecting data for common failures, reversing Resnikoff s argument

“This contradiction applies in reverse at the other end of the scale of consequences. Failures with minor consequences tend to be allowed to occur precisely because they do not matter very much. As a result, large quantities of historical data will be available concerning these failures, which means that there will be ample material for accurate actuarial analyses. These may even reveal some age limits. However, because the failures do not matter very much, it is highly unlikely that the resulting scheduled restoration or scheduled discard tasks will be cost-effective. So while the actuarial analysis of this information may be precise, it is also likely to be a waste of time. The chief use of actuarial analysis in maintenance is to study reliability problems on the middle ground where there is an uncertain relationship between age and failures which have signijkant economic consequences.. two categories.. (qual&)), large numbers of identical items ... .. and age-related failures) where preventive and failure costs are both very high.” The basic error here is to suppose that maintenance is a question of reliability; it is really an economic problem, in which reliability is a factor. Figure 3 shows how PM, operator and maintainer training, quality control by the OEM and the number of parts included in the PM schedules affect the shape of the machine or system bathtub curve, and as a consequence, the shape of the corresponding total cost from new curve.

1 pvbreitans ROOOF inPM

1 ir g . . . . . .

Figure 3 Malleability of System Bathtub under PMand Relation to Costs and Durability

The tangent at the origin of this cost curve represents the minimum cost rate if renewal (or overhaul) takes place at the tangential age. This theme is further developed in Figure 4, to include the value added as well as the expenditure, [ 101. This is the Life-cycle Profit (LCP) principle. N e t Benef i t = Sales - C o s t s

I ............... T.ota!..P.rof!t .................................................

I/ M a x A n g l e

R e n e w a l L ” B r e a k - E v e n

Figure 4 Optimization of Nett Benefii

1999 PROCEEDINGS Annual RELIABILITY and MAINTAINABILITY Symposium 241

Page 5: RCM Implementation

The LCP concept permits maintenance to be seen as an investment rather than an expense.

Every item in a working system has an economic function or it should not be there. Properly calculated AR limits depend upon the ratio rather than the values of CF, CM, and include costs which are not paid from the maintenance budget. Properly optimized renewal schedules are cost-effective by definition, but RCM applies the method only to items with a failure-free or very low “failure rate” period at the start of life, and advises the operator to repair or change out the item as soon as this period is over, making no distinction between these policies or between first and subsequent failures or simple and complex items. Using their own argument with respect to large numbers of identical items, all the little savings would add up to a big sum also in the case of many different items. If there are enough data, doing the calculations, even properly, is quite cheap. The justification for collecting the data is that we do not know which items will fail sufficiently often to measure the distributions and calculate schedules until we have operated the system for a while. Intrinsically reliable items produce no data and so present no data collection or storage problems. The principal cost in data collection and storage is in the collection itself.

We have found that maintenance staff are willing to collect data provided that the managers put it to good use. Early schedules are based upon experience and data of uncertain relevance and may have to be changed later. It is actually the critical, frequent and expensive failures which are most likely to warrant the expense of re-design, and the ones of moderate to low cost and moderate frequency which justify preventive maintenance. Safety is best incorporated as very high failure

The advocates of RCM seem not to understand how to analyze data properly, particularly censored data. No failures of a wearing part with only moderate cost ratio over many cycles usually indicates that the PM interval is too short. When intervals are shorter than optimum, more money is wasted than by being too long by the same amount. RCM’s simplistic methods of setting intervals for renewal or inspection generally make the intervals too short. For every maintenance optimization there is an expected residual probability of failure, which can be used to check whether the policy is operating as expected, or needs adjustment. The distribution and other estimates do not have to be super-accurate to produce worthwhile savings. It is usually not the case where a physical deterioration situation exists, that operation to failure is the best policy. The cost ratio will usually be known from the FMECA, and if it is high, any reasonable schedule will be better than none until analysis of operational data gives a better estimate of the distribution.

The value of data collection and analysis is not confined to maintenance schedule adjustment. It is much more important that plant manufacturers hear about all the failures, so that they can consider re-design and avoid making the same design errors again. Nowlan & Heap [l] say that manufacturers often refuse to accept responsibility for failures which they consider due to operation beyond design limits, and that collecting such

COSt, CF.

data is therefore not worthwhile. Actually such data are very useh1 to OEM’s because they indicate the real relationship between duty and reliability and how much margin there is in the design. Designers do tend to reject data critical of their designs unless it is so well documented as to be unimpeachable. From another viewpoint, designers do well to note and accommodate the ways that their products are actually used, and design new products that can do what is needed, rather than insist that they should be used in ways which would have lost the sale to a competitor. In RCM’s own beloved airline industry, the engine manufacturers pay their customers to report part failures, the condition of parts renewed on age, and condition monitoring readings, because this helps them to improve the product and its maintenance in a very competitive market. Data analysis indicates whether PM is justified as well as how often it should be done, and we cannot be sure about either without them. It is ironic that RCM claims the outstandingly data-conscious aircraft industry as a major success, while sustaining this silly attitude to data collection. RCM’s faulty ideas arguably are delaying progress which would be possible with better, integrated IT systems.

6. THE VALUE OF HUMAN LIFE. The existence of regulations and inspectors is grim witness

to the fact that some organizations are quite willing to risk human lives in pursuit of profit. There is a price for human life in safety economics and it is sentimental nonsense to deny it [9]. It is a high one of course in a civilized society, but it is nowhere near infinite. It embraces both actual and estimated sums, including compensation, fines, loss of production, loss of reputation and community goodwill, increase in insurance premiums, and internal morale factors e.g. risk of strike. However an interval for inspection of safety-sensitive equipment is determined, the cost of the accident is implicit; it can be found by inverting the appropriate maintenance model.

Example: Safety valves on boilers are tested and reset 4 times a year at a cost of 120 dollars. The rate of developing faults which would prevent the valves blowing when required is estimated at 0.01 per year, and the demand rate (incidence of over-pressure) is 0.1 per year. The false alarm rate is assumed negligible, and the 120 dollars covers any work done to pass the tests. The cost attributed to a failure, (boiler explosion) can be estimated by assuming that the test rate is economically optimal, although it would usually be preferable to calculate the optimum test rate after assigning a cost of failure. The situation can be modeled as in the Markov diagram, Figure 5 . The mean cycle time, is

qYCM All h,O b S.V. OK 4 Failed 7- qYcM /

Figure 5 Example of Cost of Human Life

242 1999 PROCEEDINGS Annual RELIABILITY and MAINTAINABILITY Symposium

Page 6: RCM Implementation

T = 1 / A + 1 / (q + f ) = 1 / 0.01 + 1 / 4.1 = 1 0 0 2 4 9 3 ~ s (2)

(3)

The cost per year is given by

4 4 ) = C M 4 + ( C F f + C M 4 ) / T ( f + 4 ) Differentiating (3) with respect to q, equating to zero and

substituting gives CF = 2,027,435. To find the cost of a human life subtract the material cost of the failure, i.e. to rebuild the boiler and the loss of revenue plus the costs of fines etc., say 500,000 in all and divide the remainder by the expected number of fatalities, say 2, giving 763,72 7 dollars/life. Not all critical failures are fatal or potentially so. Some, like Piper Alpha, are tragic and expensive; others are just expensive.

7. RCM’S DECISION CHARTS

Failure Mode

N!

If CM is possible & Economic then do it If scheduled repair is possible & Economic then do it If Age or Block Renewal is possible & Economic then do it

Consider in that order. Else No Maintenance but maybe Redesign

(No Loss of F’nj Insu. to find failures. if not then must I 1

redesign if Safety or Environment, otherwise no Maintenance (Safety & Env’m’t) Use a combination of Tasks. If this is not feasible then Redesign is Compulsory

iI Figure 6. Simplijied Version of RCM Decision Charts The Decision Charts are the Commandments of RCM; just

as the Failure Patterns are the Credo. The order of consideration of policies in RCM is fixed, and not necessarily the best that could be done in any individual case, even according to RCM’s criterion of reliability rather than economics. Note how although the failure modes are classified, the decision process is almost the same for all. The basic classification of failures into No Loss of Function (incl. Hidden Faults), Safety & Environmental, Function Loss and Economic due to Quality or Output Loss is reasonable except for the failure to understand that there is always an economic loss if the failed item has any purpose, typified by the treatment of redundancy. If there is a standby for a machine in the system, then the failures are classified as No Loss of Function. This begs the question of the need for and prioritizing of redundancy. RCM strikes no economic balance in consideration of redundant items, it simply assumes that no loss of production occurs if there is a full-size standby, and that therefore PM is less likely to be worthwhile. In fact, of course, there is always the small probability that the standby will fail before the repair is completed, and the machine’s performance or output quality suffers if PM tasks are abandoned. Under RCM, it is implicit that machines are either operational or not; they may be showing signs of impending failure. but until failure they are fully operational. Yet the (un- simp1 ad) Decision Charts repeatedly ask whether tasks, i.e. maint .lance policies such as “scheduled restoration ”, are “tech rically feasible and worthwhile ’I. In fact technical

feasibility and cost inevitably are connected. For example, for a very high cost relative to the expected period of failure-free operation that might follow, it is technically possible to take a ball bearing out of the machine, take it apart and renew just the fatigued outer race. The high cost relative to the expected benefit is the reason why such an operation is not generally considered feasible. When we then consider the more sensible option of renewing the whole bearing, a “scheduled discard task” in RCM parlance, whether it is “worthwhile” cannot really be decided without the data necessary to find an optimum renewal interval. In contrast, it is worthwhile to repair exchanged computer mother-boards. Which policies will work certainly cannot be discovered without such information; in RCM terms, if you do not know the “Failure Pattern” (Figure 2) you cannot decide the policy. We defy any RCM practitioner to determine the failure pattern and so the policy without the necessary data to optimize the interval. Without the costs you cannot decide if it is worthwhile. In other words one might as well do it properly as badly, and there is no ducking the need for detailed data and data analysis.

8. HOW CAN RCM BE REPAIRED? The question is really, “What can be salvaged?”

a) Initial schedules should be based upon an FMECA agreed by OEM’s and users. This should remain focused upon functional reliability, but consider also quality of product and system thermodynamic efficiency.

b) Prompt feedback of failure and repair data direct to designers is vital to improve plant and products quickly enough to be useful in modem industrial conditions.

c) The maintenance schedules should be regularly reviewed by Maintenance/Quality Circles as to work content and an optimization group as to frequency.

d) The principle of LCCP, [lo] should inform all decisions, including policy reviews.

e) The decision charts should be modified to require data analysis and economic optimization, including the fusion of routines into blocks and overhauls.

f ) The need for detailed data to be collected to inform the FMECA’s, policy choices and optimizations must be faced. Savings are available if maintenance is treated as one aspect of an integrated company-wide IT system.

9. SOME FINAL THOUGHTS

a) The entire structure of RCM rests upon the shaky foundation of the faulty analyses of the 1960 data. If they are wrong then so is much else. System ROCOF curves are the result of the maintenance policy and cannot be used directly to set that policy.

b) The proposition that RCM is a fad has been substantiated. There is much over-simplification. The decision charts are “pet rules of action”, it has become a “craze”, and the failure pattems are “a piece of fancied enlightenment ”.

c) The modifications required are so extensive that it would not be fair still to call the result RCM. What is needed is

1999 PROCEEDINGS Annual RELIABILITY and MAINTAINABILITY Symposium 243

Page 7: RCM Implementation

better described as Terotechnology, [ 1 I]. Even its backers are upgrading RCM by re-defining it, e.g. [12], and this is causing more confusion, but it is not really salvageable.

d) Theoretical flaws always produce bad results sooner or later; for RCM it is likely to be later. The delay whilst policy changes work through ensures RCM is not blamed.

e) The omission of reliability and maintenance from otherwise integrated IT systems in manufacturing companies possibly is connected to the prevalence of unreformed RCM and its careless view of the need for data. Lack of will to collect data is certainly making it difficult to prove the efficacy of modem OR models for maintenance.

REFERENCES 1. Nowlan,F.S. & Heap, H), “Reliability-Centered Maintenance”, US. Dept.

of Commerce (NTIS), Springfield Va., 1978. 2. MIL-STD-2173(AS), “Reliability-centered Maintenance - Requirements

for Naval Aircraft, Weapon Systems and Support Equipment” U.S. Dept. of Defense, Washington D.C., 1986.

3. Moubray , J., “Reliability-Centred Maintenance”, Buttenvorth Heinemann., 1991

4. Ascher H & Feingold H, “Repairable Systems Reliability: Modeling, Inference, Misconceptions and their Causes”, Basel, Marcel Dekker, 1984. Sherwin D J. & Lees, F.P “An investigation of the application of failure data analysis to decision-making in maintenance of process plants”, Proceedzngs of the Instztution of Mechanical Engineers, vol 194, #29, pp301-319 (in two parts), London, 1980.

6. Kelly. A. in Davidson J. (ed), “ The Reliability of Mechanical Systems”, I Mech E Guides for the Process Industries, MEP, London, 1988, 2”d Edition 1994. AI-Najjar, B., “Improvement in effectiveness of vibration-based condition monitoring system in paper mills” Journal of Engineering Tribology ofthe LMech.E, MEP, 1998 (in press)

5

7.

8. Sherwin, D.J. & AI-Najjar, B., “Practical models for condition monitoring inspection intervals”, Proc. 3 r d ~ . Con$ ofMaintenance Societies, Adelaide, I.E.Aust May 1998

9. Resnikoff, H.L., “Mathematical Aspects of Reliability-Centered Maintenance”, Dolby Access Press, Los Altos, California, 1978

10. Ahlmann H. “Maintenance effectiveness and economic models in the terotechnology concept”, Maintenance Management International, vol 4, ~ ~ 1 3 1 - 1 3 9 , 1984

11. British Standard BS:3811, “Maintenance Management Terms in

12. Creecy, M.E., & Agarwal, R., “Maximize reliability through an Terotechnology” BSI, London., 1984.

optimized maintenance program : streamlined reliability-centered maintenance”, Proc. jrd Int ’I Con$ of Maintenance Societies, Adelaide, I.E.Aust, May 1998

13. Cole, G.K., “Practical issues relating to statistical failure analysis of aero gas turbines”, Proc. 1Mech.E. Con$ on Mech. Re1 ’y, MEP, London, 1996

BIOGRAPHY David J.Sherwin , MSc, PhD, CEng, MIMechE, MIPlantE. Dept. of Industrial Engineering Lund University Institute of Technology PO BOX 118, S-221 00, Lund, SWEDEN.

E-mail : [email protected]

David Sherwin was trained in marine engineering by the Royal Navy in which he served for 19 years. He then took an MSc in Q&R at the University of Birmingham and a PhD in Reliability Applied to Maintenance at Loughborough University of Technology. After two years with Y-ARD Ltd., a marine and off-shore consultancy, as Senior Consultant in Reliability he returned to Birmingham University where he taught and researched in Q&R and Maintenance Optimization for 10 years. He was then appointed Professor of Maintenance Engineering at Queensland University of Technology, Brisbane, Australia, and took up his present appointment as Professor of Terotechnology at Lund and Vajlxja Universities in Sweden in 1993. Dr. Sherwin is a Chartered Engineer (UK), and a member of the Institutions of Mechanical and of Plant Engineers.

244 1999 PROCEEDINGS Annual RELIABILITY and MAINTAINABILITY Symposium