do%20you%20understand%20the%20impact%20of%20failure%20in%20your%20critical%20engineering%20infrastru
DESCRIPTION
http://www.uptimeplus.co.uk/Do%20you%20understand%20the%20impact%20of%20failure%20in%20your%20critical%20engineering%20infrastructure.pdfTRANSCRIPT
Do you understand the impact of failure in your critical engineering infrastructure?
A risk based approach.
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 2 OF 13
List of Contents
Summary 3
1.1 Introduction 4
1.2 Risk Management 5
2 Critical Engineering Risk Studies 7
2.1 Compliance Risk Studies 8
3 Business Continuity 9
3.1 Real Time Risk Monitoring Tools 10
3.2 Cost benefits analysis 12
Conclusion 12
References 13
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 3 OF 13
Summary
This paper addresses the failure of companies to fully understand and mitigate the risk from their
critical engineering infrastructures, process and resource control supporting their businesses.
Failure to fully understand the risks and failure to take the correct actions to remove or reduce the
risk can result in a high cost to the business.
Uptimeplus proposes a Risk Model that that combines traditional risk management techniques
with real time risk status software management of the critical engineering infrastructure
incorporating the three key elements namely People, Process and Critical Infrastructure as
follows:
Site specific critical Infrastructure visual risk dependency model which utilises live feeds
and workflow streams to provide a real time status of both operational and capacity risks
of the critical infrastructure whether it be electrical or mechanical is provided which can
be accessed from any PC and a bespoke dashboard from any mobile device providing
data centre managers with real time operational risk.
Site specific compliance visual risk dependency model which utilises workflow streams
with automated date monitoring and escalation processes which can be accessed from
any PC and a bespoke dashboard from any mobile device.
Site specific uptimeplus processes visual risk dependency model that tracks the
implementation of uptimeplus CEM processes and provides automated date monitoring
and escalation processes which can be accessed from any PC and a bespoke dashboard
from any mobile device.
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 4 OF 13
1. Introduction
Failure to adequately identify and manage risks can result in devastating reputational and
financial impact which has been aptly demonstrated within the finance industry within recent
years. While risk management processes are widely available and utilised within business
organisations of all sizes it is unusual for those risk management processes to be adequately
documented and implemented for critical systems engineering where two main issues arise.
Firstly it can be unclear to the managing teams what the impact of a failure of an asset or a
process will have on its dependants and ultimately business operations. Secondly failures occur
and they are either not reported or are reported without sufficient clarification of the risk to
dependants and ultimately business operations.
This situation arises due to the differing range of experts that are employed in the construction
and management of a critical systems environment and the often incorrect assumption that all the
risks have been mitigated during the design and construction phase. While this is not so
prevalent in the datacentre environment it will be more common for smaller businesses running
their own critical infrastructures. Often smaller critical environments are designed and constructed
by suitably qualified teams and then handed over to a building/ facilities manager for day to day
management. It is unusual to find a Building/Facility manager that has been trained in a variety of
skill levels, i.e. Mechanical Electrical Engineering, IT engineering, Risk management, Facilities
management. It is not the aim of this document to detract from the role of building/facilities
manager but it is clear that when they are not technically trained they are then heavily reliant on
the process documentation supplied during construction and from the incumbent maintenance
suppliers to identify and mange risks. Both the physical engineering systems and the human
systems supporting them must be evaluated to ensure the total system meets the business need
with clear accountability and a full audit trail of issues raised and resolutions made
The current financial crisis within the UK has resulted in businesses and organisations of all sizes
looking to reduce operating costs and this will include maintenance and operation of the M&E
assets. This has produced a very competitive M&E maintenance environment where
maintenance companies look to reduce costs via multiskilling and reducing the number of time
based maintenances that occur but often failing to implement a predictive maintenance scheme
to detect potential asset failures. While for general building maintenance this is sufficient it is
unlikely that a site engineer is going to fully understand when a failure of an asset has actually put
the business and risk and may not even report the fact especially if a standby unit has started to
keep systems operational.
Taking all the above into consideration it is clear that any person responsible for a critical
environment must have robust operational and reporting processes in place together with clear
line of site of the impact of an asset’s failure on its dependencies.
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 5 OF 13
1.2 Risk Management
The oxford dictionary defines risk management as “The forecasting and evaluation of financial
risks together with the identification of procedures to avoid or minimize their impact”. [1] A robust
risk management processes will identify the risk and evaluate the impact in conjunction with the
probability on business operations and assets. It will also identify mitigating controls to reduce or
remove the risk and provide some form of monitoring to ensure that the necessary actions and
resolutions are implemented and recorded. Before identifying a risk it needs to be understood
what the key drivers are as shown in Fig1.
This paper will be concentrating on operational risks related to the M&E critical systems and the
internal information systems that are required to identify assess and report any risk to business
operations. The two recognised methods for identifying risks are the quantitative and qualitative
approach. The qualitative risk assessment is generally considered to be a very straightforward
process based on judgement requiring no specialist skills or complicated techniques.
Fig 1. Risk Drivers [2]
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 6 OF 13
Risk assessment of critical systems engineering will be quantitative where a numerical estimate is
made of the probability that a defined harm will result from the occurrence of a particular event.
Various methods are used to determine the numerical value including the following:
Comparative Methods
Checklists
Audits
Fundamental Methods
Deviation Analysis
Hazard and Operability Studies
Energy Analysis
Failure Modes & Effects Analysis
Failure Logic
Fault Trees
Event Trees
Cause-Consequence diagrams
Once risks have been identified an evaluated and action plan should be created and reviewed
before implementation, typically by asking:
Will the revised controls lead to tolerable risk levels?
Are new hazards created?
Has the most cost-effective solution been chosen?
What do people affected think about the need for, and practicality of, the revised
preventive measures?
Will the revised controls be used in practice, and not ignored in the face of, for
example, pressures to get the job done?
There is a variety of software packages on the market used for the qualitative and quantitative
risk assessments and these provide away of quantifying and managing risk to ensure that any
identified mitigation procedures and processes are implemented and that this implementation, or
not, is recorded. However these systems can provide a building manager with a false sense of
security especially where critical engineering systems are concerned. The typical modus operandi
is that the risk assessment of the critical systems is made, processes and procedures
implemented and then there will be a long time interval before the risks and procedures are
reviewed if at all.
Risk assessment should be seen as a continuing process. Thus, the adequacy of control
measures should be subject to continual review and revised if necessary
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 7 OF 13
2 Critical Engineering Risk Studies
The first step to disaster tolerance is risk avoidance and the way to avoid risk in critical
engineering is to identify and remove or mitigate single points of failure. In a critical engineering
risk the three key elements are Technology, People and Process and a Single Point of Failure
study should be carried out on all three elements. To ensure these provide an accurate
assessment the following key points should be understood and clarified with the client before the
survey starts.
1. List of critical areas and supporting services needed to maintain business operations.
2. Original design intent of the critical engineering system.
3. Number of staff needed to maintain business operations
4. External IT equipment and Links required to maintain operations
5. Client’s business continuity plans and timescales before they are implemented.
6. Cost of implementing Business continuity plan
7. Value of loss caused by loss of business operations
Unless all the above are clearly understood there is a real danger that risks will be identified in
the Single Point of Failure study with appropriate measures to mitigate the individual risks which
are actually unjustified when compared to the value of the loss of business operation or
implementing the business continuity plan. The single points of failure survey will review the
following areas to identify the impact of failure on dependant assets or business operations.
Internal
Standby Power systems
Power
Cabling
Cooling
Segregation of Critical systems
Fire Suppression & Detection
Flood prevention
Training
Personnel
Emergency operating procedures
External
Supply Power
Flood Risks
Security
Transport links
Carrying out single points of failure surveys is common practice within the industry and there is no
argument that once completed it will provide a building manager an understanding of his risk and
what is required to remove or mitigate that risk.
What it does not provide is a real time view of the actual risk to his systems when an asset fails so
he/she can correctly evaluate possible impact and decide what actions need to be carried out.
Depending on the quality of processes and personnel the building manager is often left unaware
that an asset has failed that could in time affect business operations.
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 8 OF 13
2.1 Compliance Risk Studies
For businesses and organisations failure to comply with statutory regulations can cost both
financial and reputational loss as well as the risk of prosecution. The UK is heavily regulated and
ensuring compliance of the regulated tasks with planned maintenance, continual monitoring and
completion of identified actions is a high burden on resources.
Periodic Electric Review
PAT Testing
Fire Certification
Fire Alarm Testing
Emergency Light Testing
Fire Extinguishers
Fire Risk Assessment
Boiler Certification (oil, gas & LPG)
Landlords Gas Safety Certificate
Energy Performance Certification
Asbestos Surveys
Air Conditioning Servicing
Lighting Protection Equipment
Health and Safety Laboratory, an agency of the Health and Safety Executive was tasked by
HSE’s Legionella Committee in September 2011 to gather data on outbreaks of Legionnaires’
disease in Great Britain. This was completed for a 10 year period to August 2011, to identify the
relationship with a range of factors.
It can be seen from the above that 63% of the enforcements were due to legionella outbreaks on
hot and cold water systems. Building Managers who fail to understand and control the risk of
regulatory compliance tasks will find themselves not only part of the statistics but also, depending
on the size of the impact, in the headlines!
As previously identified there are number of software packages for managing risk and compliance
however very few businesses and organisations invest the time and money to ensure they are
operated effectively and time based audits will always find issues of either missed inspection
dates or corrective actions not completed. It has become accepted that time based audits will
always find issues and that it is a way of checking on incumbent maintenance providers and
pushing them to get tasks completed.
Fig 2. Legionella Enforcements [3]
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 9 OF 13
3 Business Continuity
Business Continuity Management (BCM) is the process of planning to ensure that your business
can return to "business as usual" as quickly and painlessly as possible in the event of a major
disruption. “Around half of all businesses experiencing a disaster with no effective plans for
recovery fail within the following 12 months” [4]. Businesses and organisations have a range of
software packages and consulting companies to assist them with devising and implementing a
business continuity plan but they all use the following basic planning and implementation steps
for ensuring business continuity.
Step 1: Analyse your business
Step 3: Plan and prepare
Step 2: Assess the risks
Step 4: Communicate your plan
Step 5: Test your plan
To ensure that a business continuity plan is effective it must have been tested and unless full
testing is completed, documented and assessed a business will never fully understand if it’s
contingency planning is sufficient to mitigate disaster. Having an effective business continuity
management plan that has been tested will provide insight to what level of resilience is required
with its critical engineering infrastructure and if we take the following extreme cases clearly
Business 2 will require a far more resilient infrastructure
Buisness 1 - Provides finance solutions to
businesses on a software platform that has 5
mirrored servers in five countries and
business operations will only be impacted if
all 5 severs are down at the same time.
Business 2 - Provides finance solutions to
businesses on a software platform that has 1
servers in 1 country.
However if Business 1 has decided, due to its operating model, that it does not need resilient
infrastructures but has not fully tested that it can operate utilising only one server then clearly they
are leaving themselves at risk. If a business relies on mirror sites as part of their contingency
plans then they must ensure their testing is effective and complete and for each site would have
to carry out the following:
Shut down IT servers
Remove all power to the property
Disconnect all data-links to the property
Very few firms can demonstrate that they have gone to these lengths to simulate an entire
building loss often choosing software data transfer and testing as an alternative. Businesses must
have a global view of their business and understand the risks across their entire portfolio and also
have a means to identify the resilience impact on the businesses as failures occur.
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 10 OF 13
3.1 Real Time Risk Monitoring Tools
Being able to identify key risk issues and illustrate these clearly and concisely to colleagues and
business leaders, who are often non-technical, is a key requirement in the decision making and
management process. [5]. Key factors to the success of critical engineering environments are:
Visibility
Transparency
Accountability
Auditability
communicate quickly and
accurately
It is uptimeplus proposal that for businesses to fully understand their risk across a range of
systems real time monitoring and modelling systems are the way forward.
Critical Systems Linking live status information from critical engineering systems to a visual risk dependency
model will provide the building manager with accurate real time information regarding the
operating status of his plant. In addition to this the visual risk dependency model would provide
clear indication of the risk to the failed assets and ultimately the risk to his business operations.
Having this information available to key staff will ensure that consensus is quickly obtained
regarding the correct course of action, if any, to mitigate or remove the risk whether it be changes
to the M&E systems themselves or moving critical workflows to other sites.
Compliance Providing businesses with real time visual data for regulatory compliance coupled with workflow
systems that will automatically issue reminders of inspection and testing dates will reduce the
need for frequent time based audits and so reduce resource.
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 11 OF 13
Operating Procedures
To prevent an incident escalating from a risk to a disaster requires standard and emergency
operating procedures to be in place and utilised. Standard Operating Procedures are required to
reduce the risk of an incident occurring by providing forward planning of staffing resource, staff
Training and technical operation of the critical systems. Emergency Operation Procedures are
required to ensure that when an incident does occur the correct action is taken by the onsite
teams to prevent a disaster. Providing businesses with real time visual data regarding the status
of all operating procedures will provide them assurance that the required procedures are in place
and also give them access to those procedures so they can familiarise themselves emergency
requirements.
Global View of Business Operations By providing a global view of the real-time risk levels to an entire business portfolio will ensure
that appropriate decisions are made with respect to any implementation that may compound an
identified risk. If a single critical site is at risk and this is immediately highlighted, having the
ability to understand what has caused the risk will ensure that it is a) not repeated at other sites
and b) gives you the opportunity to stop scheduled work that may impact your contingency.
Providing real time risk monitoring tools with clear visual indication of status will ensure
businesses have confidence that there risk has been omitted or reduced to an acceptable level
and also provide the visibility, transparency, accountability, auditability required for critical
environments. As the information is globally available it will increase the ability to communicate
quickly and accurately so the correct decisions can be made when an incident occurs.
Ultimately this would reduce resource for both the Business and its support staff as the system
would be self-policing.
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 12 OF 13
3.2 Risk Impact Cost analysis
The amount of money a business is going to invest in its critical engineering and business
continuity plans will be representative of the losses that could be incurred in the event of a
disaster. “In one case, the cost of a single interruption mounted to over €40 million. The
total annual cost of the power interruptions in this company’s case was estimated to be in
the region of €88 million”. Clearly this business had not carried out sufficient risk management
of its critical engineering to protect against this loss, however, it may have been that the cost of
implementing risk mitigation far out exceeded the cost of any losses. Before any risk mitigation is
carried out whether it is for people, processes or technology risk impact cost analysis must be
carried out.
The risk Impact cost benefit analysis will identify the cost to restore to restore services in a given
time frame compared to the financial losses caused by downtime. This will provide you with
details of the maximum cost benefit however other points need to be factored in such as the likely
hood of repeat failures and reputational loss by even one failure. These extra factors may mean
that a business will invest heavily in risk mitigation to ensure impact costs are minimal even
though this is not the most cost beneficial approach.
For a risk impact cost benefit to be useful the business must have a business continuity plan and
completed Critical Engineering Risk Studies to identify the risks and real time risk information will
enable businesses to produce effective models to ensure there money is spent wisely.
Conclusion
The proposed model of real time risk modelling provides complete transparency of critical
systems, compliance and operating procedures for both businesses and maintenance providers.
This will provide the visibility, accountability, auditability required for critical environments. The
visual risk dependency model will allow all businesses to understand the impact on operations if
there is an asset failure or the increased risks that may or may not be prevalent during
maintenance periods. In addition transparency of the systems is self-policing and will reduce the
resource required for time based audits.
Do you understand the impact of failure in your critical engineering infrastructure?
PAGE 13 OF 13
About Authors:
Andrew Dutton Robert Clayton
CEM Director Senior Critical Infrastructure Manager
uptimeplus uptimeplus
1290, Aztec West 1290, Aztec West
Almondsbury, Bristol Almondsbury, Bristol
BS32 4SG BS32 4SG
Email: [email protected] Email: [email protected]
Web: http://uptimeplus.co.uk Web: http://uptimeplus.co.uk
References
[1] Oxford Dictionary.
[2] Institute of Risk management - A Risk Management Standard
[3] http://www.hse.gov.uk - hex1207.pdf - Legionella outbreaks and HSE investigations
[4] BSCM Operational Risk – Critical Engineering
[5] Greater London Authority.