do%20you%20understand%20the%20impact%20of%20failure%20in%20your%20critical%20engineering%20infrastru

13
Do you understand the impact of failure in your critical engineering infrastructure? A risk based approach.

Upload: integral-uk-ltd

Post on 24-Mar-2016

252 views

Category:

Documents


0 download

DESCRIPTION

http://www.uptimeplus.co.uk/Do%20you%20understand%20the%20impact%20of%20failure%20in%20your%20critical%20engineering%20infrastructure.pdf

TRANSCRIPT

Page 1: Do%20you%20understand%20the%20impact%20of%20failure%20in%20your%20critical%20engineering%20infrastru

Do you understand the impact of failure in your critical engineering infrastructure?

A risk based approach.

Page 2: Do%20you%20understand%20the%20impact%20of%20failure%20in%20your%20critical%20engineering%20infrastru

Do you understand the impact of failure in your critical engineering infrastructure?

PAGE 2 OF 13

List of Contents

Summary 3

1.1 Introduction 4

1.2 Risk Management 5

2 Critical Engineering Risk Studies 7

2.1 Compliance Risk Studies 8

3 Business Continuity 9

3.1 Real Time Risk Monitoring Tools 10

3.2 Cost benefits analysis 12

Conclusion 12

References 13

Page 3: Do%20you%20understand%20the%20impact%20of%20failure%20in%20your%20critical%20engineering%20infrastru

Do you understand the impact of failure in your critical engineering infrastructure?

PAGE 3 OF 13

Summary

This paper addresses the failure of companies to fully understand and mitigate the risk from their

critical engineering infrastructures, process and resource control supporting their businesses.

Failure to fully understand the risks and failure to take the correct actions to remove or reduce the

risk can result in a high cost to the business.

Uptimeplus proposes a Risk Model that that combines traditional risk management techniques

with real time risk status software management of the critical engineering infrastructure

incorporating the three key elements namely People, Process and Critical Infrastructure as

follows:

Site specific critical Infrastructure visual risk dependency model which utilises live feeds

and workflow streams to provide a real time status of both operational and capacity risks

of the critical infrastructure whether it be electrical or mechanical is provided which can

be accessed from any PC and a bespoke dashboard from any mobile device providing

data centre managers with real time operational risk.

Site specific compliance visual risk dependency model which utilises workflow streams

with automated date monitoring and escalation processes which can be accessed from

any PC and a bespoke dashboard from any mobile device.

Site specific uptimeplus processes visual risk dependency model that tracks the

implementation of uptimeplus CEM processes and provides automated date monitoring

and escalation processes which can be accessed from any PC and a bespoke dashboard

from any mobile device.

Page 4: Do%20you%20understand%20the%20impact%20of%20failure%20in%20your%20critical%20engineering%20infrastru

Do you understand the impact of failure in your critical engineering infrastructure?

PAGE 4 OF 13

1. Introduction

Failure to adequately identify and manage risks can result in devastating reputational and

financial impact which has been aptly demonstrated within the finance industry within recent

years. While risk management processes are widely available and utilised within business

organisations of all sizes it is unusual for those risk management processes to be adequately

documented and implemented for critical systems engineering where two main issues arise.

Firstly it can be unclear to the managing teams what the impact of a failure of an asset or a

process will have on its dependants and ultimately business operations. Secondly failures occur

and they are either not reported or are reported without sufficient clarification of the risk to

dependants and ultimately business operations.

This situation arises due to the differing range of experts that are employed in the construction

and management of a critical systems environment and the often incorrect assumption that all the

risks have been mitigated during the design and construction phase. While this is not so

prevalent in the datacentre environment it will be more common for smaller businesses running

their own critical infrastructures. Often smaller critical environments are designed and constructed

by suitably qualified teams and then handed over to a building/ facilities manager for day to day

management. It is unusual to find a Building/Facility manager that has been trained in a variety of

skill levels, i.e. Mechanical Electrical Engineering, IT engineering, Risk management, Facilities

management. It is not the aim of this document to detract from the role of building/facilities

manager but it is clear that when they are not technically trained they are then heavily reliant on

the process documentation supplied during construction and from the incumbent maintenance

suppliers to identify and mange risks. Both the physical engineering systems and the human

systems supporting them must be evaluated to ensure the total system meets the business need

with clear accountability and a full audit trail of issues raised and resolutions made

The current financial crisis within the UK has resulted in businesses and organisations of all sizes

looking to reduce operating costs and this will include maintenance and operation of the M&E

assets. This has produced a very competitive M&E maintenance environment where

maintenance companies look to reduce costs via multiskilling and reducing the number of time

based maintenances that occur but often failing to implement a predictive maintenance scheme

to detect potential asset failures. While for general building maintenance this is sufficient it is

unlikely that a site engineer is going to fully understand when a failure of an asset has actually put

the business and risk and may not even report the fact especially if a standby unit has started to

keep systems operational.

Taking all the above into consideration it is clear that any person responsible for a critical

environment must have robust operational and reporting processes in place together with clear

line of site of the impact of an asset’s failure on its dependencies.

Page 5: Do%20you%20understand%20the%20impact%20of%20failure%20in%20your%20critical%20engineering%20infrastru

Do you understand the impact of failure in your critical engineering infrastructure?

PAGE 5 OF 13

1.2 Risk Management

The oxford dictionary defines risk management as “The forecasting and evaluation of financial

risks together with the identification of procedures to avoid or minimize their impact”. [1] A robust

risk management processes will identify the risk and evaluate the impact in conjunction with the

probability on business operations and assets. It will also identify mitigating controls to reduce or

remove the risk and provide some form of monitoring to ensure that the necessary actions and

resolutions are implemented and recorded. Before identifying a risk it needs to be understood

what the key drivers are as shown in Fig1.

This paper will be concentrating on operational risks related to the M&E critical systems and the

internal information systems that are required to identify assess and report any risk to business

operations. The two recognised methods for identifying risks are the quantitative and qualitative

approach. The qualitative risk assessment is generally considered to be a very straightforward

process based on judgement requiring no specialist skills or complicated techniques.

Fig 1. Risk Drivers [2]

Page 6: Do%20you%20understand%20the%20impact%20of%20failure%20in%20your%20critical%20engineering%20infrastru

Do you understand the impact of failure in your critical engineering infrastructure?

PAGE 6 OF 13

Risk assessment of critical systems engineering will be quantitative where a numerical estimate is

made of the probability that a defined harm will result from the occurrence of a particular event.

Various methods are used to determine the numerical value including the following:

Comparative Methods

Checklists

Audits

Fundamental Methods

Deviation Analysis

Hazard and Operability Studies

Energy Analysis

Failure Modes & Effects Analysis

Failure Logic

Fault Trees

Event Trees

Cause-Consequence diagrams

Once risks have been identified an evaluated and action plan should be created and reviewed

before implementation, typically by asking:

Will the revised controls lead to tolerable risk levels?

Are new hazards created?

Has the most cost-effective solution been chosen?

What do people affected think about the need for, and practicality of, the revised

preventive measures?

Will the revised controls be used in practice, and not ignored in the face of, for

example, pressures to get the job done?

There is a variety of software packages on the market used for the qualitative and quantitative

risk assessments and these provide away of quantifying and managing risk to ensure that any

identified mitigation procedures and processes are implemented and that this implementation, or

not, is recorded. However these systems can provide a building manager with a false sense of

security especially where critical engineering systems are concerned. The typical modus operandi

is that the risk assessment of the critical systems is made, processes and procedures

implemented and then there will be a long time interval before the risks and procedures are

reviewed if at all.

Risk assessment should be seen as a continuing process. Thus, the adequacy of control

measures should be subject to continual review and revised if necessary

Page 7: Do%20you%20understand%20the%20impact%20of%20failure%20in%20your%20critical%20engineering%20infrastru

Do you understand the impact of failure in your critical engineering infrastructure?

PAGE 7 OF 13

2 Critical Engineering Risk Studies

The first step to disaster tolerance is risk avoidance and the way to avoid risk in critical

engineering is to identify and remove or mitigate single points of failure. In a critical engineering

risk the three key elements are Technology, People and Process and a Single Point of Failure

study should be carried out on all three elements. To ensure these provide an accurate

assessment the following key points should be understood and clarified with the client before the

survey starts.

1. List of critical areas and supporting services needed to maintain business operations.

2. Original design intent of the critical engineering system.

3. Number of staff needed to maintain business operations

4. External IT equipment and Links required to maintain operations

5. Client’s business continuity plans and timescales before they are implemented.

6. Cost of implementing Business continuity plan

7. Value of loss caused by loss of business operations

Unless all the above are clearly understood there is a real danger that risks will be identified in

the Single Point of Failure study with appropriate measures to mitigate the individual risks which

are actually unjustified when compared to the value of the loss of business operation or

implementing the business continuity plan. The single points of failure survey will review the

following areas to identify the impact of failure on dependant assets or business operations.

Internal

Standby Power systems

Power

Cabling

Cooling

Segregation of Critical systems

Fire Suppression & Detection

Flood prevention

Training

Personnel

Emergency operating procedures

External

Supply Power

Flood Risks

Security

Transport links

Carrying out single points of failure surveys is common practice within the industry and there is no

argument that once completed it will provide a building manager an understanding of his risk and

what is required to remove or mitigate that risk.

What it does not provide is a real time view of the actual risk to his systems when an asset fails so

he/she can correctly evaluate possible impact and decide what actions need to be carried out.

Depending on the quality of processes and personnel the building manager is often left unaware

that an asset has failed that could in time affect business operations.

Page 8: Do%20you%20understand%20the%20impact%20of%20failure%20in%20your%20critical%20engineering%20infrastru

Do you understand the impact of failure in your critical engineering infrastructure?

PAGE 8 OF 13

2.1 Compliance Risk Studies

For businesses and organisations failure to comply with statutory regulations can cost both

financial and reputational loss as well as the risk of prosecution. The UK is heavily regulated and

ensuring compliance of the regulated tasks with planned maintenance, continual monitoring and

completion of identified actions is a high burden on resources.

Periodic Electric Review

PAT Testing

Fire Certification

Fire Alarm Testing

Emergency Light Testing

Fire Extinguishers

Fire Risk Assessment

Boiler Certification (oil, gas & LPG)

Landlords Gas Safety Certificate

Energy Performance Certification

Asbestos Surveys

Air Conditioning Servicing

Lighting Protection Equipment

Health and Safety Laboratory, an agency of the Health and Safety Executive was tasked by

HSE’s Legionella Committee in September 2011 to gather data on outbreaks of Legionnaires’

disease in Great Britain. This was completed for a 10 year period to August 2011, to identify the

relationship with a range of factors.

It can be seen from the above that 63% of the enforcements were due to legionella outbreaks on

hot and cold water systems. Building Managers who fail to understand and control the risk of

regulatory compliance tasks will find themselves not only part of the statistics but also, depending

on the size of the impact, in the headlines!

As previously identified there are number of software packages for managing risk and compliance

however very few businesses and organisations invest the time and money to ensure they are

operated effectively and time based audits will always find issues of either missed inspection

dates or corrective actions not completed. It has become accepted that time based audits will

always find issues and that it is a way of checking on incumbent maintenance providers and

pushing them to get tasks completed.

Fig 2. Legionella Enforcements [3]

Page 9: Do%20you%20understand%20the%20impact%20of%20failure%20in%20your%20critical%20engineering%20infrastru

Do you understand the impact of failure in your critical engineering infrastructure?

PAGE 9 OF 13

3 Business Continuity

Business Continuity Management (BCM) is the process of planning to ensure that your business

can return to "business as usual" as quickly and painlessly as possible in the event of a major

disruption. “Around half of all businesses experiencing a disaster with no effective plans for

recovery fail within the following 12 months” [4]. Businesses and organisations have a range of

software packages and consulting companies to assist them with devising and implementing a

business continuity plan but they all use the following basic planning and implementation steps

for ensuring business continuity.

Step 1: Analyse your business

Step 3: Plan and prepare

Step 2: Assess the risks

Step 4: Communicate your plan

Step 5: Test your plan

To ensure that a business continuity plan is effective it must have been tested and unless full

testing is completed, documented and assessed a business will never fully understand if it’s

contingency planning is sufficient to mitigate disaster. Having an effective business continuity

management plan that has been tested will provide insight to what level of resilience is required

with its critical engineering infrastructure and if we take the following extreme cases clearly

Business 2 will require a far more resilient infrastructure

Buisness 1 - Provides finance solutions to

businesses on a software platform that has 5

mirrored servers in five countries and

business operations will only be impacted if

all 5 severs are down at the same time.

Business 2 - Provides finance solutions to

businesses on a software platform that has 1

servers in 1 country.

However if Business 1 has decided, due to its operating model, that it does not need resilient

infrastructures but has not fully tested that it can operate utilising only one server then clearly they

are leaving themselves at risk. If a business relies on mirror sites as part of their contingency

plans then they must ensure their testing is effective and complete and for each site would have

to carry out the following:

Shut down IT servers

Remove all power to the property

Disconnect all data-links to the property

Very few firms can demonstrate that they have gone to these lengths to simulate an entire

building loss often choosing software data transfer and testing as an alternative. Businesses must

have a global view of their business and understand the risks across their entire portfolio and also

have a means to identify the resilience impact on the businesses as failures occur.

Page 10: Do%20you%20understand%20the%20impact%20of%20failure%20in%20your%20critical%20engineering%20infrastru

Do you understand the impact of failure in your critical engineering infrastructure?

PAGE 10 OF 13

3.1 Real Time Risk Monitoring Tools

Being able to identify key risk issues and illustrate these clearly and concisely to colleagues and

business leaders, who are often non-technical, is a key requirement in the decision making and

management process. [5]. Key factors to the success of critical engineering environments are:

Visibility

Transparency

Accountability

Auditability

communicate quickly and

accurately

It is uptimeplus proposal that for businesses to fully understand their risk across a range of

systems real time monitoring and modelling systems are the way forward.

Critical Systems Linking live status information from critical engineering systems to a visual risk dependency

model will provide the building manager with accurate real time information regarding the

operating status of his plant. In addition to this the visual risk dependency model would provide

clear indication of the risk to the failed assets and ultimately the risk to his business operations.

Having this information available to key staff will ensure that consensus is quickly obtained

regarding the correct course of action, if any, to mitigate or remove the risk whether it be changes

to the M&E systems themselves or moving critical workflows to other sites.

Compliance Providing businesses with real time visual data for regulatory compliance coupled with workflow

systems that will automatically issue reminders of inspection and testing dates will reduce the

need for frequent time based audits and so reduce resource.

Page 11: Do%20you%20understand%20the%20impact%20of%20failure%20in%20your%20critical%20engineering%20infrastru

Do you understand the impact of failure in your critical engineering infrastructure?

PAGE 11 OF 13

Operating Procedures

To prevent an incident escalating from a risk to a disaster requires standard and emergency

operating procedures to be in place and utilised. Standard Operating Procedures are required to

reduce the risk of an incident occurring by providing forward planning of staffing resource, staff

Training and technical operation of the critical systems. Emergency Operation Procedures are

required to ensure that when an incident does occur the correct action is taken by the onsite

teams to prevent a disaster. Providing businesses with real time visual data regarding the status

of all operating procedures will provide them assurance that the required procedures are in place

and also give them access to those procedures so they can familiarise themselves emergency

requirements.

Global View of Business Operations By providing a global view of the real-time risk levels to an entire business portfolio will ensure

that appropriate decisions are made with respect to any implementation that may compound an

identified risk. If a single critical site is at risk and this is immediately highlighted, having the

ability to understand what has caused the risk will ensure that it is a) not repeated at other sites

and b) gives you the opportunity to stop scheduled work that may impact your contingency.

Providing real time risk monitoring tools with clear visual indication of status will ensure

businesses have confidence that there risk has been omitted or reduced to an acceptable level

and also provide the visibility, transparency, accountability, auditability required for critical

environments. As the information is globally available it will increase the ability to communicate

quickly and accurately so the correct decisions can be made when an incident occurs.

Ultimately this would reduce resource for both the Business and its support staff as the system

would be self-policing.

Page 12: Do%20you%20understand%20the%20impact%20of%20failure%20in%20your%20critical%20engineering%20infrastru

Do you understand the impact of failure in your critical engineering infrastructure?

PAGE 12 OF 13

3.2 Risk Impact Cost analysis

The amount of money a business is going to invest in its critical engineering and business

continuity plans will be representative of the losses that could be incurred in the event of a

disaster. “In one case, the cost of a single interruption mounted to over €40 million. The

total annual cost of the power interruptions in this company’s case was estimated to be in

the region of €88 million”. Clearly this business had not carried out sufficient risk management

of its critical engineering to protect against this loss, however, it may have been that the cost of

implementing risk mitigation far out exceeded the cost of any losses. Before any risk mitigation is

carried out whether it is for people, processes or technology risk impact cost analysis must be

carried out.

The risk Impact cost benefit analysis will identify the cost to restore to restore services in a given

time frame compared to the financial losses caused by downtime. This will provide you with

details of the maximum cost benefit however other points need to be factored in such as the likely

hood of repeat failures and reputational loss by even one failure. These extra factors may mean

that a business will invest heavily in risk mitigation to ensure impact costs are minimal even

though this is not the most cost beneficial approach.

For a risk impact cost benefit to be useful the business must have a business continuity plan and

completed Critical Engineering Risk Studies to identify the risks and real time risk information will

enable businesses to produce effective models to ensure there money is spent wisely.

Conclusion

The proposed model of real time risk modelling provides complete transparency of critical

systems, compliance and operating procedures for both businesses and maintenance providers.

This will provide the visibility, accountability, auditability required for critical environments. The

visual risk dependency model will allow all businesses to understand the impact on operations if

there is an asset failure or the increased risks that may or may not be prevalent during

maintenance periods. In addition transparency of the systems is self-policing and will reduce the

resource required for time based audits.

Page 13: Do%20you%20understand%20the%20impact%20of%20failure%20in%20your%20critical%20engineering%20infrastru

Do you understand the impact of failure in your critical engineering infrastructure?

PAGE 13 OF 13

About Authors:

Andrew Dutton Robert Clayton

CEM Director Senior Critical Infrastructure Manager

uptimeplus uptimeplus

1290, Aztec West 1290, Aztec West

Almondsbury, Bristol Almondsbury, Bristol

BS32 4SG BS32 4SG

Email: [email protected] Email: [email protected]

Web: http://uptimeplus.co.uk Web: http://uptimeplus.co.uk

References

[1] Oxford Dictionary.

[2] Institute of Risk management - A Risk Management Standard

[3] http://www.hse.gov.uk - hex1207.pdf - Legionella outbreaks and HSE investigations

[4] BSCM Operational Risk – Critical Engineering

[5] Greater London Authority.