mike taylor: lessons from history - case studies that might help spot where things can go wrong

Lessons from history – Case studies that might help spot where things can

go wrong

Mike Taylor, Advitech Pty Ltd, Mayfield, Australia

Incident Prevention Strategy, Feb 2016

• Risk-based intervention - develop a framework for the ongoing identification and verification of risk profiling, incorporating risk control measure verification, and consideration of deployment practices to target areas of risk priority.

• Human and organisational factors - research and consider the impact of human and organisational factors on risk management and reporting.

• G Hill cartoon

A few clues on where risk control measures

may be weak or missing altogether

• “We’ll risk assess that out” • “Everybody knows” assumptions • Specification errors • Management systems • Unclear responsibilities • Human error

Other warning signs

• Too much emphasis on the risk assessment process, rather than the outcomes

• Some methods good for establishing priorities, but not much else

• Reliance placed on barriers and controls • Controls may not be as effective as first thought • Control weaknesses may lie dormant for years

A commonly-used method

What about barriers and controls?

• Essential to list them • Essential to judge their effectiveness • Be wary of re-evaluating risk until proposed

barriers and controls are in place and found to be effective

• Sometimes the existing controls are the ones that are the weakest

Faults and failures

• Failure: Function not performed

• Fault: Loss of capability to perform the function when called upon to do so

• Dangerous undetected faults: May lie dormant for years before failure actually occurs

• Initial fault may be random or non-random

Random hardware failures

• Corrosion, wear, seizure, loosening, etc • Predictable as to their rate, but not as to when

the next failure will occur • Often detected and repaired before any damage

caused

• Various sources of information available (histories)

• Conventional statistical analysis and modeling

Engineers comforted by predictability and numbers

• Calculating probability of failure on demand, based on a uniform failure rate λ :

PFDG = 2 [(1- βD) λDD + (1- β) λDD]2 tCE tGE

+ βD λDD MTTR + β λDU ( T1/2 + MRT)

• Perhaps even seduced by the numbers?

Non-random failures

• So-called “systematic failures” • Not related to normal degradation mechanisms

of corrosion, wear, etc • Deterministic rather than probabalistic • Often more difficult to detect and eliminate • Actual failure may be the first indication of

trouble

What can be learned from history of non-random faults and failures?

• Quantitative information (component life, failure modes, etc) generally not applicable

• Fewer obvious examples, unlike failures of hardware components

• Not amenable to statistical analysis or modeling

• Subtle, underlying causes, often overlooked in post-incident investigations

Why might systematic (non-random) failures receive less attention?

• People may assume that existing management systems and processes are able to deal with them

• Examples: – Design reviews – Approvals processes – Issues tracking – Management of change – Check / back-check systems

Case studies

• Barriers and controls found to be less effective than initially assumed

• Non-random failures. Events not equally likely. • Underlying faults or weaknesses that can remain

undetected for long periods

Clapham Junction, London, 1988

• Three trains collided • 35 people killed • Signal was green when it should have been red

• A wiring fault, after modification work • Immediate fault was dormant for about eight

hours • Underlying fault dormant for years

• (pic site)

Source: Hidden A, 1989, Investigation into the Clapham Junction Railway Accident, Department of Transport, London

• (pic site)

Source: Hidden A, 1989

Milton Keynes, North London, 2008

• Signal was green when it should have been red

• Fault was noticed before a collision could occur • A software specification error, as part of

modification work • Fault was dormant for months

Non-random failures

• Random hardware failures – Corrosion – Wear – Fatigue – etc

• Predictable as to their rate, but not as to when the next Source: RAIB, 2010 Special Investigation – Review of the railway industry’s investigation of an irregular signal sequence at Milton Keynes, 29 December 2008, Department of Transport

Falkirk, Scotland, 2009

• Points were set in the wrong position for the train to pass safely

• Train at 100 km/hour, fortunately did not derail • A wiring fault, after modification work • Fault was dormant for a few hours • Underlying fault dormant for years

Case study: Falkirk, Scotland, 2009

• Points were set in the wrong position for the train to pass safely

• Train at 80 km/ hour fortunately did not derail • A wiring fault, after modification work • Proper testing not carried out after the work

Source: RAIB, 2010 Rail Accident Report Incident at Greenhill Upper Junction, near Falkirk 22 March 2009, Department of Transport Report 04/2010

Non-random failures


• Predictable as to their rate, but not as to when the next one will occur

Source: RAIB, 2010 Rail Accident Report Incident Report 04/2010

Falkirk, Scotland

• Wire count not performed in the field • Field workers assumed wire count done in the

workshop

Cootamundra, NSW, 2009

• Signal was green when it should have been red

• Fault was noticed before a collision could occur • An error during the design was not properly

tracked • Fault was dormant for two years

Source: ATSB TRANSPORT SAFETY REPORT Rail Occurrence Investigation RO-2009-009 , Reported signal irregularity at Cootamundra NSW involving trains ST22 and 4MB7 , 12 November 2009

Minneapolis, MN, 2007

• Steel bridge collapsed • 13 persons killed • Design fault, carried through to construction

• Fault was dormant for 40 years

Source: National Transportation Safety Board, Accident report NTSB/HAR-08/03 PB2008-916203, Collapse of I-35W Highway Bridge Minneapolis, Minnesota , August 1, 2007.

Source: Accident Report NTSB/HAR-08/03 PB2008-916203

USAir, Aliquippa, PA, 1994

• Aircraft crashed during landing approach, with all on board lost

• Control system failure • Original failure modes analysis anticipated such a

failure • Analysis did not properly anticipate the effects • Fault was dormant for 25 years • Fault not revealed until two other aircraft

incidents

Source: Aircraft Accident Report – Uncontrolled Descent and Collision with Terrain US Air Flight 427, Boeing 737-300, N513AU, Near Alquippa, Pennsylvania, September 8 1994 National Transportation Safety Board PB 99-910401

Source: National Transportation Safety Board PB 99-910401

Alaska Airlines,

Anacapa Island, CA, 2000

• Aircraft crashed soon after take-off. All on board lost.

• Mechanical failure of screw thread and nut • Evidence of wear could have been detected, but

was not • Fault was dormant for ten years

Source: Aircraft Accident Report Loss of Control and Impact with Pacific Ocean Alaska Airlines Flight 261 McDonnell Douglas MD83, N963AS About 2.7 Miles North of Anacapa Island, California January 31, 2000, National Transportation Safety Board NTSB/AAR02/01 PB2002-910402

Non-random failures


• Predictable as to their rate, but not as to when the next one will occur

Source: National Transportation Safety Board NTSB/AAR-02/01 PB2002-910402

Source: National Transportation Safety Board NTSB/AAR-02/01 PB2002-910402

American Airlines,

Belle Harbor, NY, 2001

• Aircraft crashed shortly after take-off, with all on board lost

• Pilot error • Haptic feedback (“feel”) of rudder pedals

different from many other similar aircraft • Aggressive use of rudder. Vertical stabilizer

overloaded.

Source: Aircraft Accident Report NTSB/AAR-04/04 , In-Flight Separation of Vertical Stabilizer American Airlines Flight 587 Airbus Industrie A300-605R, N14053 Belle Harbor, New York November 12, 2001, National Transportation Safety Board, PB2004-910404 Notation 7439B

Cape Hillsborough, Qld, Australia, 2003

• Emergency medical services helicopter mission • Aircraft crashed into sea on foggy night, with all

on board lost • Possible loss of spatial orientation • Several key risk factors present • Operators unaware of US study into risk factors • Fault was dormant for ten years

Source: Aviation Safety Investigation 2003 04282, Bell 407 VH-HT Cape Hillsborough, Qld, 17 October 2003, Australian Transport Safety Bureau

Markham Colliery, UK, 1973

• Brake rod broke (fatigue fracture) • 18 people killed • Poor design: No practicable means of lubrication

• Warning from 1961 incident • Crack probably present when inspected in 1961

Source: Calder JW , 1974, Accident at Markham Colliery Derbyshire: report on the cause of, and circumstances attending, the overwind, which occurred at Markham Colliery, Derbyshire, on 30 July 1973. Department of Energy

Source: Calder JW , 1974

Qantas, Batam Island, Indonesia, 2010

• A380 engine rotor failure • Significant damage from debris • Caused by broken oil feed pipe, poorly

manufactured • Failure modes analysis did not properly anticipate

the effects • Two faults, each dormant for several years

Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013. In-flight uncontained engine failure Airbus A380, VH0QA, overhead Bantam Island, Indonesia, 4 November 2010

Source: ATSB Transport Safety Report Aviation Occurrence Investigation Report AO-210-089, 27 June 2013

Conclusions

• Plenty of new mistakes to be made, without repeating the old ones

• Human error implicated in most of these cases • Human error rates much higher than those for

physical devices • Statistics not much help when dealing with non-

random failures

Conclusions

• Easy to lose sight of the real issues if just focused on process

• Misplaced reliance on barriers and controls, especially existing controls

• Weakness can remain dormant for years

Implications for designers and operators

• Recognise that one systematic fault can undo all the good work with random hardware failurepredictions

• Recognise the places where things can go wrong: – Specification errors – Failure mode assumptions – “Everybody knows” assumptions – Unclear responsibilities

• Look for subtle signs of problems duringoperations

Thank you for your attention

mike taylor: lessons from history - case studies that might help spot where things can go wrong

Government & Nonprofit