advancing medical equipment maintenance using rcm methodology
DESCRIPTION
Advancing Medical Equipment Maintenance using RCM Methodology. Malcolm G. Ridgway, Ph.D., CCE Senior Vice President, Technology Management Masterplan, Inc., Chatsworth, California. How A Machine Fails Traditional / Classical Concept (Pre-1945). First Generation Maintenance (Pre-1945). - PowerPoint PPT PresentationTRANSCRIPT
1
Malcolm G. Ridgway, Ph.D., CCESenior Vice President, Technology Management
Masterplan, Inc., Chatsworth, California
AdvancingMedical Equipment
Maintenanceusing
RCM Methodology
2
How A Machine Fails Traditional / Classical Concept
(Pre-1945)
3
First Generation Maintenance
(Pre-1945)
Was – like the machines – relatively simple.
Primary maintenance strategy was “keep it looking sharp” and “Run To Failure”
Primary maintenance tool was an oily rag
4
How A Machine Fails Second Generation Concept
The “Bath Tub” Curve
5
Second Generation Maintenance
(1945 - 60)
Was – like the machines – a little more complex because the consequences of unreliable machines had become more serious (economically).
Maintenance strategy – Fixed Interval Overhauls
PM was still relatively primitive – more of a craft than a science, and based on the manufacturer’s experience-based (?) recommendations.
6
Third Generation Maintenance
(1960s)
Became – like the machines – considerably more complex. The civil aviation industry became the driver on machine reliability because of the FAA’s concerns for the public safety
1960 - FAA established a Task Force which became known as the Maintenance Steering Group (MSG)
1968 – Landmark document (MSG-1) revolutionized the maintenance business and made the 747 viable
7
How Machines Really Fail Third Generation Concept
Based on FAA data
8
In the case of aircraft components
Only 6% show a wear-out failure (Type B) pattern And only 14% have a random failure (Type E) pattern
Whereas
72% show an infant mortality (Type F) characteristic
9
The Famous Moment of Enlightenment
in the 1960s…
...About Scheduled Maintenance
10
More frequent PM can lead to lower reliability !!
11
How This New Approach To Maintenance Made Jumbo Jets
Economically FeasibleDC8 – Required the scheduled overhaul of 339 items and 4M man-hours of maintenance prior to its 20,000 hour inspection
DC10 – Required the scheduled overhaul of 7 items and 66K man-hours of maintenance prior to its 20,000 hour inspection
The DC10 is 3X larger, more complex, and 200X more reliable than the DC8
The “event” rate of the DC 8 is 60 per million takeoffs;
The “event” rate of the DC10 is 0.3 per million takeoffs.
12
The 1970s
Introduction of the systems approach to maintenance
1974 – DOD contracted with United Airlines to document the maintenance processes being used by the civil aviation industry, and directed that the new approach embodied in the pioneering new concepts be labeled Reliability-Centered Maintenance (RCM).
1978 – Publication of the book “Reliability-Centered Maintenance” by Stanley Nowlan and Howard Heap.
13
Explosive growth of RCM during the 80s & 90s
The military adopts RCM for its ships (including its nuclear submarines) and its aircraft
NASA joins in with its Shuttle Program
The utility industry adopts RCM for many of its power stations, including its nuclear power plants.
1982 – MSG-3 rev 2 Type Certification for the 757/ 767
14
What Exactly Is Reliability-Centered Maintenance?
Uses processes based on modern reliability analyses Considers the entire system: equipment; accessories;
user; maintainer; environment; utilities; & the patient Focuses on maintaining the device’s function with
minimum downtime and acceptable levels of safety Uses FMEA to define what can go wrong and why Uses precise effectiveness metrics and criteria for
whether or not proactive maintenance is cost effective If interval-based maintenance is feasible, it provides
precise formulas for what the intervals should be
15
Benefits (claimed to result)
from using RCM
1. Increased reliability – 50-70% reduction in repairs
2. Increased availability – 25-50% reduction in downtime
3. Greater maintenance cost effectiveness
4. Improved levels of safety
5. Longer useful life of maintained items
6. Creation of comprehensive maintenance databases
16
Current Joint Commission Standards
Standard EC.02.04.01The hospital manages medical equipment risks
Elements of Performance for EC.02.04.01
3. The hospital identifies the activities, in writing, for maintaining, inspecting, and testing for all medical equipment on the inventory
Note: Hospitals may use different strategies for different items, as appropriate. For example, strategies such as predictive maintenance, reliability-centered maintenance, interval-based inspections, corrective maintenance, or metered maintenance may be selected to ensure reliable performance.
17
Reality Check
• Maintenance (particularly PM) is an issue of declining importance - relative to several other equipment issues (such as use errors and network connectivity)
• But we are still dedicating an estimated 3000 FTEs (costing about $300M /year) to our PM programs
• We could (and should) be doing something more productive and more valuable with these resources !
18
Key PM Issues
1. We still do not have a good consensus on what we mean by the term “PM”, or even why we do it !
2. Although the Joint Commission has allowed us to exclude “non-critical” devices from our PM programs since 1989, we still don’t have a rational definition for a non-critical/ non-life-support device.
3. We don’t have any good methods for justifying the PM intervals that we use.
4. The PM procedures that most of us use could be improved.
19
What Causes Equipment To Fail? (1)
1) Progressive wear or deterioration of a component part
2) Random failure of a component part
3) Poor fabrication or assembly of the hardware
4) Poor design of the system (hardware or processes)
5) Subjecting the device to physical stress outside its design tolerances
6) Exposing the device to environmental stress outside its design tolerances
20
What Causes Equipment To Fail? (2)
7) Incorrect set up or operation of the device by the user
8) The use of a wrong or defective accessory
9) Poor or incomplete initial set-up or installation, or a poor quality previous repair
10) Human interference with the device including (possibly) earlier intrusive PM
Only the first and (possibly) the last of these could be classed as maintenance-related failures
21
Hidden failures Equipment failures are either likely to be noticed
(they are evident…i.e.overt) or they are hidden.
Ideally, devices that are safety-critical or downtime-critical and that have hidden failure modes i.e. failures that are unlikely to be noticed by the “operating crew” should be provided with special protection mechanisms.
It is important to subject devices that are safety critical or downtime-critical and that have hidden failure modes, without reliable special protection mechanisms , to appropriate performance and safety testing.
22
Special Protection Mechanisms
1) Operator warning devices
2) Automatic shut-down devices
3) Automatic relief devices
4) Dual components for functional redundancy
5) Guard mechanisms
Special concern = “multiple failures” = failure modes within the protection mechanisms
23
PM Basics – Why do we do it?
• PM should address: 1. Failures that result from the degradation of
the device’s non-durable parts and 2. Detecting the presence of hidden failures.
• PM cannot and does not prevent all types of equipment failures.
• There are several other, more common, causes of device failure.
• Very important PM issue = hidden failures of any special protection mechanisms
24
What does PM achieve?
• PM prevents some equipment failures and the associated downtime.
• It creates a certain (usually unspecified) level of confidence that the devices tested are safe (because they are not in a hidden failed state).
25
Indirect benefits of PM programs
1. Finding failed or damaged devices that have not been reported as needing to be repaired
2. Periodically confirming that the devices are actually still present in the facility
3. Providing some level of comfort and security that everything possible is being done to maximize the level of equipment safety.
26
What PM does not achieve?
• PM cannot and does not prevent all equipment failures – only those that would have resulted from the degradation of the device’s non-durable parts.
• PM cannot and does not mitigate the most
common causes of adverse equipment-related accidents
27
The Bottom Line on PM
• With respect to: • reducing the downtime of downtime-critical equipment, and • eliminating the most common causes of adverse equipment
-related incidents and accidents…..
• ..even a well implemented PM program provides only a relatively limited value – and it also has a cost
• The more we can optimize the program and quantify the benefits, the easier it will be to balance the value gained from a well-implemented PM program against its cost
28
Better PM terminology
• True preventive maintenance (TPM) = inspecting, cleaning, lubricating, adjusting or replacing the device’s non-durable parts… (aka scheduled restoration, scheduled discard tasks or predictive maintenance - JIT remediation via Condition Monitoring)
• Performance verification and/or safety testing (PVST) = functional testing to detect hidden failures … (aka failure-finding tasks)
29
TPM = True Preventive Maintenance
…is the inspection, cleaning, lubricating, adjustment or replacement of a device’s non-durable parts.
Non-durable parts are those components of the device that have been identified either by the device manufacturer or by general industry experience as needing periodic attention, or being subject to functional deterioration and having a useful lifetime less than that of the complete device.
Examples include filters, batteries, cables, bearings, gaskets, and flexible tubing.
30
Predictive Maintenance…
…involves direct monitoring of some variable that will provide a reliable early warning that a non-durable part is about to fail (aka Condition Monitoring).
An example might be using an oil contaminant sensor in your car’s engine lubricant to turn on a dashboard warning light to tell you when it is time to change your oil.
At the moment this particular PM strategy probably has more potential in the physical plant area than in the biomedical area.
Physical plant examples include: using vibration analysis to warn of bearing wear, and using infrared scanning to detect overheating in electrical switchgear
31
PVST = Performance Verification and Safety Testing
…is functional testing to detect hidden failures.
Examples of hidden failures include: Defibrillators that are delivering significantly less energy than they are set to deliver; heart rate alarms that do not alarm at the set threshold, and protective power cut-offs on hypo-hyperthermia machines that do not operate at the pre-set cut-off temperature.
32
33
34
35
Special features of the ASHE format
• The procedure number as a “universal product code”
• Separation of the TPM and PVST tasks
• Use of the Note box for concise reporting
• User tasks disclaimer
36
37
38
39
Repair Call Cause Coding
40
Repair Call Cause Coding Cat 1 Are the device and its accessories still working
properly and safely? If yes, this a Category 1 failure (aka: use error; “cannot duplicate”).
Cat 2. Is the device itself OK; the problem is due to use of a wrong or defective accessory or problem in a connected network? If …
Cat 3. Is the problem due to physical stress? If … Cat 4. Is there evidence that this problem could be the
result of a poor initial installation or an incomplete repair of a previous problem (a “run on”)? If ….
Cat 5. Is there evidence that the failure was due to an out-of-tolerance ambient environmental condition?
41
Repair Call Cause Coding
Cat 8. Is there evidence that the failure is due to a battery problem? If yes, ….
Cat 7. Is there evidence that the failure was due to a lack of preventive maintenance? If yes, ….
Cat 8. Is there evidence that the failure was caused by human interference e.g. earlier intrusive PM? If
Cat 9. Is there any reason to believe that the failure was due to general wear and tear? If yes, ….
Cat 0. The cause of failure is unknown (cannot be categorized).
42
Typical Cause Coding Analysis
Code Cause of repair call Call
Count%age Aust.
1 User-related 54 10.2 14%
2 Accessory or connectivity 7 1.3 3%
3 Physical stress-related 120 22.8 25%
4 Run-on related 11 2.1 1%
5 Environmental stress-related 13 2.5 1%
6 Battery-related 32 6.1 -
7 Inadequate PM-related 17 3.2 1%
8 Human interference-related 0 0 0
9 Random, unpredictable failures 273 51.8 52%
0 Uncategorized repair calls
527 100 100%
43
Some types of devices will benefit more than others from receiving PM:
(1) Those with non-durable parts
1. Identify all possible PM–preventable failure modes by examining each TPM task listed in the PM procedure
2. Perform a PM Risk Analysis. Rank each failure mode according to the Level of Severity of its potential adverse consequences (LOS score).
3. Estimate the MTBF (Likelihood of Occurrence score) (How far out is the knee on the Type B Failure Curve)
4. Multiply the LOS score by the LOO score to determine the device’s PM Risk Score.
44
Classifying the Level of Severity (LOS) of any likely adverse consequences from
(1) any non-durable parts-related failures
LOS
4 A PM-preventable failure mode that could be life-threatening or economically “catastrophic” ($$$$)
3 A PM-preventable failure mode that could cause an injury, have a major impact on patient care, or ($$$)
2 A PM-preventable failure mode that could have some impact on patient care, or facility economics ($$)
1 A PM-preventable failure mode that would have only a minor impact on patient care, or facility economics ($)
45
Adverse consequences of (overt) equipment failures
Three different kinds of consequences: 1. Adverse safety consequences
• Life-threatening (LOS = 4), safety-major concern (LOS=3), safety-moderate concern (LOS=2), safety-only minor concern
2. Adverse operational consequences (uptime)• Uptime-critical (LOS = 4), uptime-major concern (LOS = 3),
uptime-moderate concern (LOS=2), etc
3. Adverse non-operational consequences (cost of repair)
• Very high cost of repair (LOS = 4), high cost of repair (LOS=3), moderate cost of repair (LOS=2), etc
46
Adverse consequences of (overt) equipment failures
Economic consequences: • Uptime-critical devices (LOS =4)
• Sophisticated imaging devices, such as CT scanners
• Uptime-major concern devices (LOS =3)• Key devices with little or no back-up, such as large
central sterilizers and automated lab analyzers
• High and very high cost of repair devices (LOS = 3 and 4)
• Specialized devices, such as lasers, some sterilizers, some ventilators, etc.
47
Classifying the Likelihood of Failure (LOF)
of (1) any non-durable parts
LOF
4 Frequent. Wear-out type failure likely to occur within a one year period (MTBF of up to 1 year)
3 Occasional. Wear-out type failure likely to occur within a one to two year period (MTBF of between 1 and 2 years)
2 Uncommon. Wear-out type failure likely to occur within a two to five year period (MTBF of between 2 and 5 years)
1 Remote. Wear-out type failure not likely to occur within a five year period (MTBF of more than 5 years)
48
RCM Risk Score. Compounding Level of Severity (LOS)
and Likelihood of Failure (LOF)
LOS = 4 4 8 12 16
LOS = 3 3 6 9 12
LOS = 2 2 4 6 8
LOS = 1 1 2 3 4LOF = 1“Remote”
LOF = 2“Uncommon”
LOF = 3“Occasional”
LOF = 4“Frequent”
12 - 16 = Critical risk 6 – 9 = “Worth doing”
49
Some types of devices will benefit more than others from receiving PM: (2) Those with hidden failure modes
1. Identify all possible hidden failure modes by examining each PVST task listed in the PM procedure
2. Perform a PM Risk Analysis. Rank each hidden failure mode according to the Level of Severity of its potential adverse consequences (LOS Score).
3. Rank the Likelihood of Failure of each hidden failure (LOF Score) by reviewing data on the “yield” of previous PVST testing (# of HFs/ device-year)
4. Multiply the LOS Score by the LOF Score to determine the device’s PM Risk Score.
50
Classifying the Level of Severity (LOS) of any likely adverse consequences from
(2) any hidden failures
LOS
4 A hidden failure mode that could be life-threatening or economically “catastrophic” ($$$$s)
3 A hidden failure mode that could cause an injury or have a major impact on patient care (or $$$s)
2 A hidden failure mode that could have some moderate impact on patient care (or $$s)
1 A hidden failure mode that would have only a minor impact on patient care (or only $)
51
Adverse consequences of hidden equipment failures
Safety consequences: • Safety-life-threatening devices (LOS
=4)• Defibrillator with zero or very low output
• Safety-major impact devices (LOS =3)• Blood warmer with defective over-temp alarm• Hypo/ hyperthermia with defective over-temp
alarm or power cut-off mechanism
52
Classifying the Likelihood of Failure (LOF) of
(2) any hidden failures
LOO
4 Frequent. “Yield” or hidden failure discovery rate of more than 1 per device- year
3 Occasional. “Yield” or hidden failure discovery rate of 0.5 – 1.0 per device- year
2 Uncommon. “Yield” or hidden failure discovery rate of 0.2 – 0.5 per device- year
1 Remote. “Yield” or hidden failure discovery rate of
less than 0.2 per device- year
53
RCM Risk Score. Compounding Level of Severity (LOS)
and Likelihood of Failure (LOF)
LOS = 4 4 8 12 16
LOS = 3 3 6 9 12
LOS = 2 2 4 6 8
LOS = 1 1 2 3 4LOF = 1“Remote”
LOF = 2“Uncommon”
LOF = 3“Occasional”
LOF = 4“Frequent”
12 - 16 = Critical risk 6 – 9 = “Worth doing”
54
Classifying a device’s PM Priority according to its (worst-case)
RCM Risk Score
Risk Score
PM Priority
12 -16 1 “Must-do PM” = (PM–Critical)
6 - 9 2 PM judged to be “worth doing”
3 - 4 3 PM worth doing – if economics justify (3A) – otherwise (3B) RTF
1 - 2 0 Do no PM = “Run to Failure”
55
Documenting the PM Risk Analysis (1)
Note device type and PM procedure number
For each TPM task statement• Describe briefly the severity of the consequence if
this part degenerates either partially or totally• Is the LOS a 4,3,2 or 1?• Estimate the time lapse before this degeneration
will occur. Is the LOF a 4,3,2, or 1?• What is the combined RCM Risk Score?• What is the corresponding PM Priority Level?
56
Documenting the PM Risk Analysis (2)
For each PVST task statement• Describe briefly the hidden failure that this testing will
detect and the severity of the consequences• Is the LOS a 4,3,2 or 1?• Consult database or estimate how often this failure is
likely to occur. Is the LOF a 4,3,2, or 1?• What is the combined RCM Risk Score?• What is the corresponding PM Priority Level?
If worst case is Priority 1,2 or 3A, which PM strategy will be implemented?
If implementing fixed interval PM, what is the optimum?
57
Alternative PM strategies1. Performing JIT TPM when indicated by direct condition monitoring (aka Predictive Maintenance)
• Optimum approach, but techniques are scarce
2. Using JIT on-board automated or operator-implemented performance and safety testing
• optimum approach, but no techniques available (yet)
3. Using variable intervals based on usage (metered maintenance)4. Using fixed intervals (prescriptive or optimized)
• This is the traditional approach, favored by many regulators
5. Allowing the device to Run-to-Failure• Most cost-effective approach for PM Priority 3B and 0 devices
58
Selecting the most cost-effective PM strategy
If device is PM Priority 3B or 0 – Use RTF Otherwise – select in the following order
• JIT TPM / JIT PVST (Predictive Maintenance)• Metered maintenance • Fixed interval (optimized)• Fixed interval (prescriptive)
59
Infusion Pump Analysis
1. Using standard FMEA analysis from the classical RCM method, the Thorburn team from The Royal Adelaide Hospital in South Australia identified 145 potential failure modes.
2. But only six were judged to be addressable by some kind of PM task
3. One had a risk score of 8 (PM Priority 2) which the team described as “worth doing”
60
Metrics for Monitoring PM Effectiveness
1. What percentage of repair calls are caused by Category 7 failures (lack of PM) - and what percentage were considered to be in the highest Level of Severity?
2. The frequency of occurrence and level of potential severity of equipment-related patient incidents that were attributable to a hidden failure
61
Determining PM intervals
How we do it now
• Based on the Fennigkoh-Smith EM number (No-no)
• Whatever the manufacturer recommends (?)
• Pursuant to the JC’s July 1, 2001 revision to EC.1.6. (f) and EC.2.10.3. permitting “maintenance strategies” other than the traditional time-based inspection intervals.Text change from “apply professional judgment” to “data-driven decisions” (But which data and how?)
62
Finding Optimum PM Intervals
1) For Predictive (On-Condition) Maintenance - this involves finding a condition monitoring technique with a long P – F (warning) interval
2) For TPM (aka scheduled restoration or scheduled discard tasks) – this requires knowledge of the device’s age-related failure pattern.
3) For PVST functional testing (aka failure-finding tasks) - this requires data on the device’s Mean Time Between Failures (MTBF).
63
Finding the Optimum PM Interval 2) For TPM (True Preventive Maintenance)
• Requires knowledge of the device’s age-related failure pattern (interval exploration)
• The period between being put into service and the “knee” is called the Economic Life Limit.
• The most efficient interval is just less than 100% of the Economic Life Limit.
64
• The period between being put into service and the “knee” is called the Economic Life Limit.
• Most efficient interval is just less than 100% of the economic life limit.
Failu
re
Rate
Time
Age-related failure pattern
65
Finding the Optimum PM Interval 3) For PVST (functional testing)
• Requires knowledge of the failure mode’s mean time between failures (MTBF) – from PM testing database
• And what level of confidence (LOC) is desired that the device is in a “safe operating condition” (SOC)?
• These two factors set the maximum testing interval.
66
100 devices were checked annually for 4 yearsHidden failure (e.g. high leakage current) found 16 times
MTBF = 400 (device-years)/ 16 = 25 years
From this data we can establish a statistical probability (level of confidence) that, between the tests, one of these devices was actually in a (hidden) failed state
16 devices were in a failed state for (on average) 6 monthsTotal hidden downtime was therefore 8 device-yearsProbability that device in (hidden) failed state = 8/ 400 = 2%Probability that device is in safe operating condition = 98%
Hypothetical data from 4 years of PM testing
67
According to RCM theory, the relationship between the MTBF, the testing interval (TI), and the probability that the device is in a (hidden) failed state (HFS) is:
HFS (%) = 50 x TI (in years) / MTBF (in years)
And the level of confidence (LOC) that the device is in a safe operating condition is:
LOC (%) = 100 – HFS (%)
As the ratio of the test interval to the MTBF gets smaller, the probability that the device is in a (hidden) failed state also gets smaller.
68
The ratio of the test interval (TI) to the MTBF determines the Level of Confidence (LOC) that the device is in a
Safe Operating Condition (i.e. not in a HFS)
TI (yrs)
MTBF (yrs)
HFS (%)
LOC/SOC (%)
0.5 25 1% 99%
0.5 50 0.5% 99.5%
0.5 100 0.25% 99.75%
1 25 2% 98%
1 50 1% 99%
1 100 0.5% 99.5%
2 50 2% 98%
2 100 1% 99%
4 100 2% 98%
4 200 1% 99%
69
Relationship between the LOC (that the device is not in a HFS), the Testing
Interval (TI) and the MTBF
1%
2%
HFS LOC
98%
99%
Testing Interval (TI) in years
1 2 3 4
25 Y
rs.
50 Y
rs.
100 Yrs.
MTBF= 200 Yrs.
70
Manufacturer-recommended maintenance intervals
Legal question: “Did you follow the manufacturer’s maintenance recommendations?”
Selection of the optimum interval requires knowledge of the NDP’s age-related failure pattern
Extensive (pre-market) testing in a simulated environment is time consuming and costly. Therefore it is highly likely that the manufacturer’s recommendations are based more on “guestimates” than on actual testing.
71
Questions ?