advancing medical equipment maintenance using rcm methodology

1

Malcolm G. Ridgway, Ph.D., CCESenior Vice President, Technology Management

Masterplan, Inc., Chatsworth, California

AdvancingMedical Equipment

Maintenanceusing

RCM Methodology

2

How A Machine Fails Traditional / Classical Concept

(Pre-1945)

3

First Generation Maintenance

(Pre-1945)

Was – like the machines – relatively simple.

Primary maintenance strategy was “keep it looking sharp” and “Run To Failure”

Primary maintenance tool was an oily rag

4

How A Machine Fails Second Generation Concept

The “Bath Tub” Curve

5

Second Generation Maintenance

(1945 - 60)

Was – like the machines – a little more complex because the consequences of unreliable machines had become more serious (economically).

Maintenance strategy – Fixed Interval Overhauls

PM was still relatively primitive – more of a craft than a science, and based on the manufacturer’s experience-based (?) recommendations.

6

Third Generation Maintenance

(1960s)

Became – like the machines – considerably more complex. The civil aviation industry became the driver on machine reliability because of the FAA’s concerns for the public safety

1960 - FAA established a Task Force which became known as the Maintenance Steering Group (MSG)

1968 – Landmark document (MSG-1) revolutionized the maintenance business and made the 747 viable

7

How Machines Really Fail Third Generation Concept

Based on FAA data

8

In the case of aircraft components

Only 6% show a wear-out failure (Type B) pattern And only 14% have a random failure (Type E) pattern

Whereas

72% show an infant mortality (Type F) characteristic

9

The Famous Moment of Enlightenment

in the 1960s…

...About Scheduled Maintenance

10

More frequent PM can lead to lower reliability !!

11

How This New Approach To Maintenance Made Jumbo Jets

Economically FeasibleDC8 – Required the scheduled overhaul of 339 items and 4M man-hours of maintenance prior to its 20,000 hour inspection

DC10 – Required the scheduled overhaul of 7 items and 66K man-hours of maintenance prior to its 20,000 hour inspection

The DC10 is 3X larger, more complex, and 200X more reliable than the DC8

The “event” rate of the DC 8 is 60 per million takeoffs;

The “event” rate of the DC10 is 0.3 per million takeoffs.

12

The 1970s

Introduction of the systems approach to maintenance

1974 – DOD contracted with United Airlines to document the maintenance processes being used by the civil aviation industry, and directed that the new approach embodied in the pioneering new concepts be labeled Reliability-Centered Maintenance (RCM).

1978 – Publication of the book “Reliability-Centered Maintenance” by Stanley Nowlan and Howard Heap.

13

Explosive growth of RCM during the 80s & 90s

The military adopts RCM for its ships (including its nuclear submarines) and its aircraft

NASA joins in with its Shuttle Program

The utility industry adopts RCM for many of its power stations, including its nuclear power plants.

1982 – MSG-3 rev 2 Type Certification for the 757/ 767

14

What Exactly Is Reliability-Centered Maintenance?

Uses processes based on modern reliability analyses Considers the entire system: equipment; accessories;

user; maintainer; environment; utilities; & the patient Focuses on maintaining the device’s function with

minimum downtime and acceptable levels of safety Uses FMEA to define what can go wrong and why Uses precise effectiveness metrics and criteria for

whether or not proactive maintenance is cost effective If interval-based maintenance is feasible, it provides

precise formulas for what the intervals should be

15

Benefits (claimed to result)

from using RCM

1. Increased reliability – 50-70% reduction in repairs

2. Increased availability – 25-50% reduction in downtime

3. Greater maintenance cost effectiveness

4. Improved levels of safety

5. Longer useful life of maintained items

6. Creation of comprehensive maintenance databases

16

Current Joint Commission Standards

Standard EC.02.04.01The hospital manages medical equipment risks

Elements of Performance for EC.02.04.01

3. The hospital identifies the activities, in writing, for maintaining, inspecting, and testing for all medical equipment on the inventory

Note: Hospitals may use different strategies for different items, as appropriate. For example, strategies such as predictive maintenance, reliability-centered maintenance, interval-based inspections, corrective maintenance, or metered maintenance may be selected to ensure reliable performance.

17

Reality Check

• Maintenance (particularly PM) is an issue of declining importance - relative to several other equipment issues (such as use errors and network connectivity)

• But we are still dedicating an estimated 3000 FTEs (costing about $300M /year) to our PM programs

• We could (and should) be doing something more productive and more valuable with these resources !

18

Key PM Issues

1. We still do not have a good consensus on what we mean by the term “PM”, or even why we do it !

2. Although the Joint Commission has allowed us to exclude “non-critical” devices from our PM programs since 1989, we still don’t have a rational definition for a non-critical/ non-life-support device.

3. We don’t have any good methods for justifying the PM intervals that we use.

4. The PM procedures that most of us use could be improved.

19

What Causes Equipment To Fail? (1)

1) Progressive wear or deterioration of a component part

2) Random failure of a component part

3) Poor fabrication or assembly of the hardware

4) Poor design of the system (hardware or processes)

5) Subjecting the device to physical stress outside its design tolerances

6) Exposing the device to environmental stress outside its design tolerances

20

What Causes Equipment To Fail? (2)

7) Incorrect set up or operation of the device by the user

8) The use of a wrong or defective accessory

9) Poor or incomplete initial set-up or installation, or a poor quality previous repair

10) Human interference with the device including (possibly) earlier intrusive PM

Only the first and (possibly) the last of these could be classed as maintenance-related failures

21

Hidden failures Equipment failures are either likely to be noticed

(they are evident…i.e.overt) or they are hidden.

Ideally, devices that are safety-critical or downtime-critical and that have hidden failure modes i.e. failures that are unlikely to be noticed by the “operating crew” should be provided with special protection mechanisms.

It is important to subject devices that are safety critical or downtime-critical and that have hidden failure modes, without reliable special protection mechanisms , to appropriate performance and safety testing.

22

Special Protection Mechanisms

1) Operator warning devices

2) Automatic shut-down devices

3) Automatic relief devices

4) Dual components for functional redundancy

5) Guard mechanisms

Special concern = “multiple failures” = failure modes within the protection mechanisms

23

PM Basics – Why do we do it?

• PM should address: 1. Failures that result from the degradation of

the device’s non-durable parts and 2. Detecting the presence of hidden failures.

• PM cannot and does not prevent all types of equipment failures.

• There are several other, more common, causes of device failure.

• Very important PM issue = hidden failures of any special protection mechanisms

24

What does PM achieve?

• PM prevents some equipment failures and the associated downtime.

• It creates a certain (usually unspecified) level of confidence that the devices tested are safe (because they are not in a hidden failed state).

25

Indirect benefits of PM programs

1. Finding failed or damaged devices that have not been reported as needing to be repaired

2. Periodically confirming that the devices are actually still present in the facility

3. Providing some level of comfort and security that everything possible is being done to maximize the level of equipment safety.

26

What PM does not achieve?

• PM cannot and does not prevent all equipment failures – only those that would have resulted from the degradation of the device’s non-durable parts.

• PM cannot and does not mitigate the most

common causes of adverse equipment-related accidents

27

The Bottom Line on PM

• With respect to: • reducing the downtime of downtime-critical equipment, and • eliminating the most common causes of adverse equipment

-related incidents and accidents…..

• ..even a well implemented PM program provides only a relatively limited value – and it also has a cost

• The more we can optimize the program and quantify the benefits, the easier it will be to balance the value gained from a well-implemented PM program against its cost

28

Better PM terminology

• True preventive maintenance (TPM) = inspecting, cleaning, lubricating, adjusting or replacing the device’s non-durable parts… (aka scheduled restoration, scheduled discard tasks or predictive maintenance - JIT remediation via Condition Monitoring)

• Performance verification and/or safety testing (PVST) = functional testing to detect hidden failures … (aka failure-finding tasks)

29

TPM = True Preventive Maintenance

…is the inspection, cleaning, lubricating, adjustment or replacement of a device’s non-durable parts.

Non-durable parts are those components of the device that have been identified either by the device manufacturer or by general industry experience as needing periodic attention, or being subject to functional deterioration and having a useful lifetime less than that of the complete device.

Examples include filters, batteries, cables, bearings, gaskets, and flexible tubing.

30

Predictive Maintenance…

…involves direct monitoring of some variable that will provide a reliable early warning that a non-durable part is about to fail (aka Condition Monitoring).

An example might be using an oil contaminant sensor in your car’s engine lubricant to turn on a dashboard warning light to tell you when it is time to change your oil.

At the moment this particular PM strategy probably has more potential in the physical plant area than in the biomedical area.

Physical plant examples include: using vibration analysis to warn of bearing wear, and using infrared scanning to detect overheating in electrical switchgear

31

PVST = Performance Verification and Safety Testing

…is functional testing to detect hidden failures.

Examples of hidden failures include: Defibrillators that are delivering significantly less energy than they are set to deliver; heart rate alarms that do not alarm at the set threshold, and protective power cut-offs on hypo-hyperthermia machines that do not operate at the pre-set cut-off temperature.

35

Special features of the ASHE format

• The procedure number as a “universal product code”

• Separation of the TPM and PVST tasks

• Use of the Note box for concise reporting

• User tasks disclaimer

39

Repair Call Cause Coding

40

Repair Call Cause Coding Cat 1 Are the device and its accessories still working

properly and safely? If yes, this a Category 1 failure (aka: use error; “cannot duplicate”).

Cat 2. Is the device itself OK; the problem is due to use of a wrong or defective accessory or problem in a connected network? If …

Cat 3. Is the problem due to physical stress? If … Cat 4. Is there evidence that this problem could be the

result of a poor initial installation or an incomplete repair of a previous problem (a “run on”)? If ….

Cat 5. Is there evidence that the failure was due to an out-of-tolerance ambient environmental condition?

41

Repair Call Cause Coding

Cat 8. Is there evidence that the failure is due to a battery problem? If yes, ….

Cat 7. Is there evidence that the failure was due to a lack of preventive maintenance? If yes, ….

Cat 8. Is there evidence that the failure was caused by human interference e.g. earlier intrusive PM? If

Cat 9. Is there any reason to believe that the failure was due to general wear and tear? If yes, ….

Cat 0. The cause of failure is unknown (cannot be categorized).

42

Typical Cause Coding Analysis

Code Cause of repair call Call

Count%age Aust.

1 User-related 54 10.2 14%

2 Accessory or connectivity 7 1.3 3%

3 Physical stress-related 120 22.8 25%

4 Run-on related 11 2.1 1%

5 Environmental stress-related 13 2.5 1%

6 Battery-related 32 6.1 -

7 Inadequate PM-related 17 3.2 1%

8 Human interference-related 0 0 0

9 Random, unpredictable failures 273 51.8 52%

0 Uncategorized repair calls

527 100 100%

43

Some types of devices will benefit more than others from receiving PM:

(1) Those with non-durable parts

1. Identify all possible PM–preventable failure modes by examining each TPM task listed in the PM procedure

2. Perform a PM Risk Analysis. Rank each failure mode according to the Level of Severity of its potential adverse consequences (LOS score).

3. Estimate the MTBF (Likelihood of Occurrence score) (How far out is the knee on the Type B Failure Curve)

4. Multiply the LOS score by the LOO score to determine the device’s PM Risk Score.

44

Classifying the Level of Severity (LOS) of any likely adverse consequences from

(1) any non-durable parts-related failures

LOS

4 A PM-preventable failure mode that could be life-threatening or economically “catastrophic” ($$$$)

3 A PM-preventable failure mode that could cause an injury, have a major impact on patient care, or ($$$)

2 A PM-preventable failure mode that could have some impact on patient care, or facility economics ($$)

1 A PM-preventable failure mode that would have only a minor impact on patient care, or facility economics ($)

45

Adverse consequences of (overt) equipment failures

Three different kinds of consequences: 1. Adverse safety consequences

• Life-threatening (LOS = 4), safety-major concern (LOS=3), safety-moderate concern (LOS=2), safety-only minor concern

2. Adverse operational consequences (uptime)• Uptime-critical (LOS = 4), uptime-major concern (LOS = 3),

uptime-moderate concern (LOS=2), etc

3. Adverse non-operational consequences (cost of repair)

• Very high cost of repair (LOS = 4), high cost of repair (LOS=3), moderate cost of repair (LOS=2), etc

46

Adverse consequences of (overt) equipment failures

Economic consequences: • Uptime-critical devices (LOS =4)

• Sophisticated imaging devices, such as CT scanners

• Uptime-major concern devices (LOS =3)• Key devices with little or no back-up, such as large

central sterilizers and automated lab analyzers

• High and very high cost of repair devices (LOS = 3 and 4)

• Specialized devices, such as lasers, some sterilizers, some ventilators, etc.

47

Classifying the Likelihood of Failure (LOF)

of (1) any non-durable parts

LOF

4 Frequent. Wear-out type failure likely to occur within a one year period (MTBF of up to 1 year)

3 Occasional. Wear-out type failure likely to occur within a one to two year period (MTBF of between 1 and 2 years)

2 Uncommon. Wear-out type failure likely to occur within a two to five year period (MTBF of between 2 and 5 years)

1 Remote. Wear-out type failure not likely to occur within a five year period (MTBF of more than 5 years)

48

RCM Risk Score. Compounding Level of Severity (LOS)

and Likelihood of Failure (LOF)

LOS = 4 4 8 12 16

LOS = 3 3 6 9 12

LOS = 2 2 4 6 8

LOS = 1 1 2 3 4LOF = 1“Remote”

LOF = 2“Uncommon”

LOF = 3“Occasional”

LOF = 4“Frequent”

12 - 16 = Critical risk 6 – 9 = “Worth doing”

49

Some types of devices will benefit more than others from receiving PM: (2) Those with hidden failure modes

1. Identify all possible hidden failure modes by examining each PVST task listed in the PM procedure

2. Perform a PM Risk Analysis. Rank each hidden failure mode according to the Level of Severity of its potential adverse consequences (LOS Score).

3. Rank the Likelihood of Failure of each hidden failure (LOF Score) by reviewing data on the “yield” of previous PVST testing (# of HFs/ device-year)

4. Multiply the LOS Score by the LOF Score to determine the device’s PM Risk Score.

50

Classifying the Level of Severity (LOS) of any likely adverse consequences from

(2) any hidden failures

LOS

4 A hidden failure mode that could be life-threatening or economically “catastrophic” ($$$$s)

3 A hidden failure mode that could cause an injury or have a major impact on patient care (or $$$s)

2 A hidden failure mode that could have some moderate impact on patient care (or $$s)

1 A hidden failure mode that would have only a minor impact on patient care (or only $)

51

Adverse consequences of hidden equipment failures

Safety consequences: • Safety-life-threatening devices (LOS

=4)• Defibrillator with zero or very low output

• Safety-major impact devices (LOS =3)• Blood warmer with defective over-temp alarm• Hypo/ hyperthermia with defective over-temp

alarm or power cut-off mechanism

52

Classifying the Likelihood of Failure (LOF) of

(2) any hidden failures

LOO

4 Frequent. “Yield” or hidden failure discovery rate of more than 1 per device- year

3 Occasional. “Yield” or hidden failure discovery rate of 0.5 – 1.0 per device- year

2 Uncommon. “Yield” or hidden failure discovery rate of 0.2 – 0.5 per device- year

1 Remote. “Yield” or hidden failure discovery rate of

less than 0.2 per device- year

53

RCM Risk Score. Compounding Level of Severity (LOS)

and Likelihood of Failure (LOF)

LOS = 4 4 8 12 16

LOS = 3 3 6 9 12

LOS = 2 2 4 6 8

LOS = 1 1 2 3 4LOF = 1“Remote”

LOF = 2“Uncommon”

LOF = 3“Occasional”

LOF = 4“Frequent”

12 - 16 = Critical risk 6 – 9 = “Worth doing”

54

Classifying a device’s PM Priority according to its (worst-case)

RCM Risk Score

Risk Score

PM Priority

12 -16 1 “Must-do PM” = (PM–Critical)

6 - 9 2 PM judged to be “worth doing”

3 - 4 3 PM worth doing – if economics justify (3A) – otherwise (3B) RTF

1 - 2 0 Do no PM = “Run to Failure”

55

Documenting the PM Risk Analysis (1)

Note device type and PM procedure number

For each TPM task statement• Describe briefly the severity of the consequence if

this part degenerates either partially or totally• Is the LOS a 4,3,2 or 1?• Estimate the time lapse before this degeneration

will occur. Is the LOF a 4,3,2, or 1?• What is the combined RCM Risk Score?• What is the corresponding PM Priority Level?

56

Documenting the PM Risk Analysis (2)

For each PVST task statement• Describe briefly the hidden failure that this testing will

detect and the severity of the consequences• Is the LOS a 4,3,2 or 1?• Consult database or estimate how often this failure is

likely to occur. Is the LOF a 4,3,2, or 1?• What is the combined RCM Risk Score?• What is the corresponding PM Priority Level?

If worst case is Priority 1,2 or 3A, which PM strategy will be implemented?

If implementing fixed interval PM, what is the optimum?

57

Alternative PM strategies1. Performing JIT TPM when indicated by direct condition monitoring (aka Predictive Maintenance)

• Optimum approach, but techniques are scarce

2. Using JIT on-board automated or operator-implemented performance and safety testing

• optimum approach, but no techniques available (yet)

3. Using variable intervals based on usage (metered maintenance)4. Using fixed intervals (prescriptive or optimized)

• This is the traditional approach, favored by many regulators

5. Allowing the device to Run-to-Failure• Most cost-effective approach for PM Priority 3B and 0 devices

58

Selecting the most cost-effective PM strategy

If device is PM Priority 3B or 0 – Use RTF Otherwise – select in the following order

• JIT TPM / JIT PVST (Predictive Maintenance)• Metered maintenance • Fixed interval (optimized)• Fixed interval (prescriptive)

59

Infusion Pump Analysis

1. Using standard FMEA analysis from the classical RCM method, the Thorburn team from The Royal Adelaide Hospital in South Australia identified 145 potential failure modes.

2. But only six were judged to be addressable by some kind of PM task

3. One had a risk score of 8 (PM Priority 2) which the team described as “worth doing”

60

Metrics for Monitoring PM Effectiveness

1. What percentage of repair calls are caused by Category 7 failures (lack of PM) - and what percentage were considered to be in the highest Level of Severity?

2. The frequency of occurrence and level of potential severity of equipment-related patient incidents that were attributable to a hidden failure

61

Determining PM intervals

How we do it now

• Based on the Fennigkoh-Smith EM number (No-no)

• Whatever the manufacturer recommends (?)

• Pursuant to the JC’s July 1, 2001 revision to EC.1.6. (f) and EC.2.10.3. permitting “maintenance strategies” other than the traditional time-based inspection intervals.Text change from “apply professional judgment” to “data-driven decisions” (But which data and how?)

62

Finding Optimum PM Intervals

1) For Predictive (On-Condition) Maintenance - this involves finding a condition monitoring technique with a long P – F (warning) interval

2) For TPM (aka scheduled restoration or scheduled discard tasks) – this requires knowledge of the device’s age-related failure pattern.

3) For PVST functional testing (aka failure-finding tasks) - this requires data on the device’s Mean Time Between Failures (MTBF).

63

Finding the Optimum PM Interval 2) For TPM (True Preventive Maintenance)

• Requires knowledge of the device’s age-related failure pattern (interval exploration)

• The period between being put into service and the “knee” is called the Economic Life Limit.

• The most efficient interval is just less than 100% of the Economic Life Limit.

64

• The period between being put into service and the “knee” is called the Economic Life Limit.

• Most efficient interval is just less than 100% of the economic life limit.

Failu

re

Rate

Time

Age-related failure pattern

65

Finding the Optimum PM Interval 3) For PVST (functional testing)

• Requires knowledge of the failure mode’s mean time between failures (MTBF) – from PM testing database

• And what level of confidence (LOC) is desired that the device is in a “safe operating condition” (SOC)?

• These two factors set the maximum testing interval.

66

100 devices were checked annually for 4 yearsHidden failure (e.g. high leakage current) found 16 times

MTBF = 400 (device-years)/ 16 = 25 years

From this data we can establish a statistical probability (level of confidence) that, between the tests, one of these devices was actually in a (hidden) failed state

16 devices were in a failed state for (on average) 6 monthsTotal hidden downtime was therefore 8 device-yearsProbability that device in (hidden) failed state = 8/ 400 = 2%Probability that device is in safe operating condition = 98%

Hypothetical data from 4 years of PM testing

67

According to RCM theory, the relationship between the MTBF, the testing interval (TI), and the probability that the device is in a (hidden) failed state (HFS) is:

HFS (%) = 50 x TI (in years) / MTBF (in years)

And the level of confidence (LOC) that the device is in a safe operating condition is:

LOC (%) = 100 – HFS (%)

As the ratio of the test interval to the MTBF gets smaller, the probability that the device is in a (hidden) failed state also gets smaller.

68

The ratio of the test interval (TI) to the MTBF determines the Level of Confidence (LOC) that the device is in a

Safe Operating Condition (i.e. not in a HFS)

TI (yrs)

MTBF (yrs)

HFS (%)

LOC/SOC (%)

0.5 25 1% 99%

0.5 50 0.5% 99.5%

0.5 100 0.25% 99.75%

1 25 2% 98%

1 50 1% 99%

1 100 0.5% 99.5%

2 50 2% 98%

2 100 1% 99%

4 100 2% 98%

4 200 1% 99%

69

Relationship between the LOC (that the device is not in a HFS), the Testing

Interval (TI) and the MTBF

1%

2%

HFS LOC

98%

99%

Testing Interval (TI) in years

1 2 3 4

25 Y

rs.

50 Y

rs.

100 Yrs.

MTBF= 200 Yrs.

70

Manufacturer-recommended maintenance intervals

Legal question: “Did you follow the manufacturer’s maintenance recommendations?”

Selection of the optimum interval requires knowledge of the NDP’s age-related failure pattern

Extensive (pre-market) testing in a simulated environment is time consuming and costly. Therefore it is highly likely that the manufacturer’s recommendations are based more on “guestimates” than on actual testing.

71

Questions ?

advancing medical equipment maintenance using rcm methodology

Documents