mike taylor: lessons from history - case studies that might help spot where things can go wrong
TRANSCRIPT
Lessons from history – Case studies that might help spot where things can
go wrong
Mike Taylor, Advitech Pty Ltd, Mayfield, Australia
Incident Prevention Strategy, Feb 2016
• Risk-based intervention - develop a framework for the ongoing identification and verification of risk profiling, incorporating risk control measure verification, and consideration of deployment practices to target areas of risk priority.
• Human and organisational factors - research and consider the impact of human and organisational factors on risk management and reporting.
• G Hill cartoon
A few clues on where risk control measures
may be weak or missing altogether
• “We’ll risk assess that out” • “Everybody knows” assumptions • Specification errors • Management systems • Unclear responsibilities • Human error
Other warning signs
• Too much emphasis on the risk assessment process, rather than the outcomes
• Some methods good for establishing priorities, but not much else
• Reliance placed on barriers and controls • Controls may not be as effective as first thought • Control weaknesses may lie dormant for years
A commonly-used method
What about barriers and controls?
• Essential to list them • Essential to judge their effectiveness • Be wary of re-evaluating risk until proposed
barriers and controls are in place and found to be effective
• Sometimes the existing controls are the ones that are the weakest
Faults and failures
• Failure: Function not performed
• Fault: Loss of capability to perform the function when called upon to do so
• Dangerous undetected faults: May lie dormant for years before failure actually occurs
• Initial fault may be random or non-random
Random hardware failures
• Corrosion, wear, seizure, loosening, etc • Predictable as to their rate, but not as to when
the next failure will occur • Often detected and repaired before any damage
caused
• Various sources of information available (histories)
• Conventional statistical analysis and modeling
Engineers comforted by predictability and numbers
• Calculating probability of failure on demand, based on a uniform failure rate λ :
PFDG = 2 [(1- βD) λDD + (1- β) λDD]2 tCE tGE
+ βD λDD MTTR + β λDU ( T1/2 + MRT)
• Perhaps even seduced by the numbers?
Non-random failures
• So-called “systematic failures” • Not related to normal degradation mechanisms
of corrosion, wear, etc • Deterministic rather than probabalistic • Often more difficult to detect and eliminate • Actual failure may be the first indication of
trouble
What can be learned from history of non-random faults and failures?
• Quantitative information (component life, failure modes, etc) generally not applicable
• Fewer obvious examples, unlike failures of hardware components
• Not amenable to statistical analysis or modeling
• Subtle, underlying causes, often overlooked in post-incident investigations
Why might systematic (non-random) failures receive less attention?
• People may assume that existing management systems and processes are able to deal with them
• Examples: – Design reviews – Approvals processes – Issues tracking – Management of change – Check / back-check systems
Case studies
• Barriers and controls found to be less effective than initially assumed
• Non-random failures. Events not equally likely. • Underlying faults or weaknesses that can remain
undetected for long periods
Clapham Junction, London, 1988
• Three trains collided • 35 people killed • Signal was green when it should have been red
• A wiring fault, after modification work • Immediate fault was dormant for about eight
hours • Underlying fault dormant for years
• (pic site)
Source: Hidden A, 1989, Investigation into the Clapham Junction Railway Accident, Department of Transport, London
• (pic site)
Source: Hidden A, 1989
Milton Keynes, North London, 2008
• Signal was green when it should have been red
• Fault was noticed before a collision could occur • A software specification error, as part of
modification work • Fault was dormant for months
Non-random failures
• Random hardware failures – Corrosion – Wear – Fatigue – etc
• Predictable as to their rate, but not as to when the next Source: RAIB, 2010 Special Investigation – Review of the railway industry’s investigation of an irregular signal sequence at Milton Keynes, 29 December 2008, Department of Transport
Falkirk, Scotland, 2009
• Points were set in the wrong position for the train to pass safely
• Train at 100 km/hour, fortunately did not derail • A wiring fault, after modification work • Fault was dormant for a few hours • Underlying fault dormant for years
Case study: Falkirk, Scotland, 2009
• Points were set in the wrong position for the train to pass safely
• Train at 80 km/ hour fortunately did not derail • A wiring fault, after modification work • Proper testing not carried out after the work
Source: RAIB, 2010 Rail Accident Report Incident at Greenhill Upper Junction, near Falkirk 22 March 2009, Department of Transport Report 04/2010
Non-random failures
• Random hardware failures – Corrosion – Wear – Fatigue – etc
• Predictable as to their rate, but not as to when the next one will occur
Source: RAIB, 2010 Rail Accident Report Incident Report 04/2010
Falkirk, Scotland
• Wire count not performed in the field • Field workers assumed wire count done in the
workshop
Cootamundra, NSW, 2009
• Signal was green when it should have been red
• Fault was noticed before a collision could occur • An error during the design was not properly
tracked • Fault was dormant for two years
Source: ATSB TRANSPORT SAFETY REPORT Rail Occurrence Investigation RO-2009-009 , Reported signal irregularity at Cootamundra NSW involving trains ST22 and 4MB7 , 12 November 2009
Minneapolis, MN, 2007
• Steel bridge collapsed • 13 persons killed • Design fault, carried through to construction
• Fault was dormant for 40 years
Source: National Transportation Safety Board, Accident report NTSB/HAR-08/03 PB2008-916203, Collapse of I-35W Highway Bridge Minneapolis, Minnesota , August 1, 2007.
Source: Accident Report NTSB/HAR-08/03 PB2008-916203
Source: Accident Report NTSB/HAR-08/03 PB2008-916203
Source: Accident Report NTSB/HAR-08/03 PB2008-916203
Source: Accident Report NTSB/HAR-08/03 PB2008-916203
USAir, Aliquippa, PA, 1994
• Aircraft crashed during landing approach, with all on board lost
• Control system failure • Original failure modes analysis anticipated such a
failure • Analysis did not properly anticipate the effects • Fault was dormant for 25 years • Fault not revealed until two other aircraft
incidents
Source: Aircraft Accident Report – Uncontrolled Descent and Collision with Terrain US Air Flight 427, Boeing 737-300, N513AU, Near Alquippa, Pennsylvania, September 8 1994 National Transportation Safety Board PB 99-910401
Source: National Transportation Safety Board PB 99-910401
Alaska Airlines,
Anacapa Island, CA, 2000
• Aircraft crashed soon after take-off. All on board lost.
• Mechanical failure of screw thread and nut • Evidence of wear could have been detected, but
was not • Fault was dormant for ten years
Source: Aircraft Accident Report Loss of Control and Impact with Pacific Ocean Alaska Airlines Flight 261 McDonnell Douglas MD83, N963AS About 2.7 Miles North of Anacapa Island, California January 31, 2000, National Transportation Safety Board NTSB/AAR02/01 PB2002-910402
Non-random failures
• Random hardware failures – Corrosion – Wear – Fatigue – etc
• Predictable as to their rate, but not as to when the next one will occur
Source: National Transportation Safety Board NTSB/AAR-02/01 PB2002-910402
Source: National Transportation Safety Board NTSB/AAR-02/01 PB2002-910402
Source: National Transportation Safety Board NTSB/AAR-02/01 PB2002-910402
American Airlines,
Belle Harbor, NY, 2001
• Aircraft crashed shortly after take-off, with all on board lost
• Pilot error • Haptic feedback (“feel”) of rudder pedals
different from many other similar aircraft • Aggressive use of rudder. Vertical stabilizer
overloaded.
Source: Aircraft Accident Report NTSB/AAR-04/04 , In-Flight Separation of Vertical Stabilizer American Airlines Flight 587 Airbus Industrie A300-605R, N14053 Belle Harbor, New York November 12, 2001, National Transportation Safety Board, PB2004-910404 Notation 7439B
Cape Hillsborough, Qld, Australia, 2003
• Emergency medical services helicopter mission • Aircraft crashed into sea on foggy night, with all
on board lost • Possible loss of spatial orientation • Several key risk factors present • Operators unaware of US study into risk factors • Fault was dormant for ten years
Source: Aviation Safety Investigation 2003 04282, Bell 407 VH-HT Cape Hillsborough, Qld, 17 October 2003, Australian Transport Safety Bureau
Markham Colliery, UK, 1973
• Brake rod broke (fatigue fracture) • 18 people killed • Poor design: No practicable means of lubrication
• Warning from 1961 incident • Crack probably present when inspected in 1961
Source: Calder JW , 1974, Accident at Markham Colliery Derbyshire: report on the cause of, and circumstances attending, the overwind, which occurred at Markham Colliery, Derbyshire, on 30 July 1973. Department of Energy
Source: Calder JW , 1974
Source: Calder JW , 1974
Qantas, Batam Island, Indonesia, 2010
• A380 engine rotor failure • Significant damage from debris • Caused by broken oil feed pipe, poorly
manufactured • Failure modes analysis did not properly anticipate
the effects • Two faults, each dormant for several years
Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013. In-flight uncontained engine failure Airbus A380, VH0QA, overhead Bantam Island, Indonesia, 4 November 2010
Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013
Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013
Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013
Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013
Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013
Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013
Conclusions
• Plenty of new mistakes to be made, without repeating the old ones
• Human error implicated in most of these cases • Human error rates much higher than those for
physical devices • Statistics not much help when dealing with non-
random failures
Conclusions
• Easy to lose sight of the real issues if just focused on process
• Misplaced reliance on barriers and controls, especially existing controls
• Weakness can remain dormant for years
Implications for designers and operators
• Recognise that one systematic fault can undo all the good work with random hardware failurepredictions
• Recognise the places where things can go wrong: – Specification errors – Failure mode assumptions – “Everybody knows” assumptions – Unclear responsibilities
• Look for subtle signs of problems duringoperations
Thank you for your attention