brighttalk outage insurance- what you need to know - final

70
Outage Insurance: Everything You Need to Know

Upload: andrew-white

Post on 12-May-2015

58 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Brighttalk   outage insurance- what you need to know - final

Outage Insurance: Everything You Need to Know

Page 2: Brighttalk   outage insurance- what you need to know - final

Mr. White has fifteen years of experience designing and managing the deployment of Systems Monitoring and Event Management software. Prior to joining IBM, Mr. White held various positions including the leader of the Monitoring and Event Management organization of a Fortune 100 company and developing solutions as a consultant for a wide variety of organizations, including the Mexican Secretaría de Hacienda y Crédito Público, Telmex, Wal-Mart of Mexico, JP Morgan Chase, Nationwide Insurance and the US Navy Facilities and Engineering Command.

Andrew White Cloud and Smarter Infrastructure Solution Specialist IBM Corporation

Page 3: Brighttalk   outage insurance- what you need to know - final

http://weheartit.com/entry/12433848!

Page 4: Brighttalk   outage insurance- what you need to know - final

Ground rules for this session… •  If you can’t tell if I am trying to be funny… –  GO AHEAD AND LAUGH!

•  Feel free to text, tweet, yammer, or whatever to share with the rest of the attendees

•  If you have a question, no need to wait until the end. Just interrupt me. Seriously… I don’t mind.

Page 5: Brighttalk   outage insurance- what you need to know - final

I am here today to share some of what I have learned about

Page 6: Brighttalk   outage insurance- what you need to know - final

We (IT) sells promises… The value of these promises depends on the customer’s perception that we are willing and capable of making good on the promise when the time comes. This perception is affected by the interactions they have with us.

Page 7: Brighttalk   outage insurance- what you need to know - final

http://www.flickr.com/photos/anneacaso/3693155059/sizes/l/in/photostream/!

Objective #1: Users Love Our IT Systems…

Page 8: Brighttalk   outage insurance- what you need to know - final

Anatomy of an Outage

Corporate!LANs & VPNs!

Load Balancer!

Firewall!

Web!Servers!

Message!Queue!

zOS!CICS!

WAS!

Database!

WAS!Database!

zOS!MQ!

DB2!

!!!!

4!

!!!!!!

3!

!!!!!!1!

5:45-ish pm: CICS ABENDS start flooding the console but not high enough to ticket!

!!!!!!2!

6:00-ish pm: MQ flows start are interrupted and are alerting in Flow Diagnostics!

6:04pm: Synthetic transactions fail at and 6:14 the Ops Center confirms the issue and creates a P0 Incident!

6:54pm: Support teams investigate the interrupted flows and determine it is a “back-end” problem!

10:29pm: Support teams investigate MQ and ultimately and rule it out and ultimately decide to reset CICS to resolve the issue!

!!!!

5!

Page 9: Brighttalk   outage insurance- what you need to know - final

http://www.flickr.com/photos/gregphoto/4881356366/sizes/l/in/photostream/!

Bad Experience!!!

Page 10: Brighttalk   outage insurance- what you need to know - final

h"p://www.ithakabound.com/wp-­‐content/uploads/2010/02/DC-­‐Snow-­‐men-­‐pushing-­‐car.jpg  

Why did this happen?!

Page 11: Brighttalk   outage insurance- what you need to know - final

Why is problem solving hard? • commencement opacity • continuation opacity

Non-transparency (lack of clarity of the situation)

•  inexpressiveness • opposition •  transience

Polytely (multiple goals)

• enumerability • connectivity (hierarchy relation, communication relation, allocation

relation) • heterogeneity

Complexity (large numbers of items, interrelations,

and decisions)

•  temporal constraints •  temporal sensitivity • phase effects • dynamic unpredictability

Dynamics (time considerations)

Page 12: Brighttalk   outage insurance- what you need to know - final

Boyd’s Loop

Observation

Outside Information

Implicit Guidance & Control

Unfolding Interaction With Environment Feedback

Feedback

Unfolding Circumstances Cultural

Norms

Cognitive Abilities

Knowledge Life Cycle

Prior Wisdom

New Information

Feed Forward Decision

(Hypothesis)

Feed Forward Action

(Test)

Feed Forward

•  Note how observation shapes orientation, shapes decision, shapes action, and in turn is shaped by the feedback and other phenomena coming into our sensing or observing window.

•  Also note how the entire “loop” (not just orientation) is an ongoing many-sided implicit cross-referencing process of projection, empathy, correlation, and rejection.

From “The Essence of Winning and Losing,” John R. Boyd, January 1996.

Observe Orient Decide Act

Page 13: Brighttalk   outage insurance- what you need to know - final

Where the Breakdown Occurs

Observe! Orient! Decide! Act!

Situational Awareness!

Perception of Elements in Current Situation!

!Level 1!

Comprehension of Current Situation!

!Level 2!

Projection of Future Status!

!!

Level 3!

Decision! Performance of Actions!

Cur

rent

Sta

te!

Feedback!

• Goals & Objectives!• Preconceptions!• Expectations!

• Abilities!• Experience!• Training!

Long Term Memory! Automaticity!

Cognitive Processes!

• System Capability!• Interface Design!• Stress & Workload!• Complexity!• Automation!

Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.!

Systemic Influences!

Individual Influences!

Page 14: Brighttalk   outage insurance- what you need to know - final

Incident Life Cycle

Down Time

Detection Time Response Time Repair Time Recovery Time Outage De

tect

ion

Diag

nosis

Repa

ir

Reco

ver

Rest

ore

Observe Orient Decide Act

Page 15: Brighttalk   outage insurance- what you need to know - final

Problem Life Cycle Evaluation  

Recognition

Observation

Analysis Solution

Validation

Control

Page 16: Brighttalk   outage insurance- what you need to know - final

Point of Observation

Past Behavior • The observation period

used to feed the forecasting models

Future Behavior • The performance

period the model is trying to predict

Predictive Modeling Timeline

Page 17: Brighttalk   outage insurance- what you need to know - final

Predictive models harness the information lost in past data so you can identify discretely identify situations and react to them quickly.

Page 18: Brighttalk   outage insurance- what you need to know - final

What Matters Most?

Dr.  Lee  Goldman  

Cook  County  Hospital,  Chicago,  IL  

§  Is the patient feeling unstable angina?

§  Is there fluid in the patient’s lungs? §  Is the patient’s systolic blood

pressure below 100?"

The Goldman Algorithm

Prediction of Patients Expected to Have a Heart Attack Within 72 Hours

0  

20  

40  

60  

80  

100  

Traditional Techniques Goldman Algorithm

By paying attention to what really matters, Dr. Goldman improved the “false negatives” by 20

percentage points and eliminated the “false positives” altogether.

Page 19: Brighttalk   outage insurance- what you need to know - final

The Goldman Algorithm

ECG Evidence of Acute Ischemia? ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) or T- Wave Inversion in ≥ 2 Contiguous Leads (New or Unknown Age) or Left Bundle-Branch Block (New or Unknown Age)

Observation Unit

Inpatient Telemetry Unit

High Risk Low Risk Very Low Risk Moderate Risk

Yes No

Coronary Care Unit

No

ECG Evidence of Acute Myocardial Infarction (MI)? ST-Segment Elevation ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) or Pathologic Q Waves in ≥ 2 Contiguous Leads (New or Unknown Age)

Yes

Patient suspected of Acute Cardiac

Ischema

Perform Electrocardiogram

(EKG)

0 Factors 2 or 3 Factors 1 Factors 0 or 1 Factors 2 or 3 Factors

Urgent Factors Present? Rates Above Both Lung Bases Systolic Blood Pressure <100 mm Hg Unstable Ischemic Heart Disease

Urgent Factors Present? Rates Above Both Lung Bases Systolic Blood Pressure <100 mm Hg Unstable Ischemic Heart Disease

Page 20: Brighttalk   outage insurance- what you need to know - final

First…

… we need to talk a little bit about your brain

Page 21: Brighttalk   outage insurance- what you need to know - final

The Triune Brain

Reptilian Brain (basal ganglia)

Mammalian Brain (limbic system)

Cognitive Brain (neocortex)

Page 22: Brighttalk   outage insurance- what you need to know - final

Our Thought Process

*** not very reliable

Cognition

Limbic Center (hypocampus and amygdala)

Cortex (hypocampus and amygdala)

Conscious Choice (via motor centers)

Most primitive, seat of unconscious

Long-term memory

Conscious, meaning, choice

Perception (via the senses)***

Pre-Frontal Cortex (hypocampus and amygdala)

Stimulus

Page 23: Brighttalk   outage insurance- what you need to know - final

Short Term Memory

Your Brain Working Memory Understanding Judgement Relationship

Short-term memory is where the real work of sense-making takes place

Short-term memory has a limited amount of space (The estimate is 7 ± 2)

Page 24: Brighttalk   outage insurance- what you need to know - final

The big-data dilema

Time

Qua

ntity

Information the brain can consume

Page 25: Brighttalk   outage insurance- what you need to know - final

Information is cheap. Understanding is expensive. -Karl Fast, Professor of UX Design, Kent State University

Page 26: Brighttalk   outage insurance- what you need to know - final

• Patterns • Comparisons • Organization

Information

• Decisions • Skill • Adaptation

Intelligence

• Trends • Generalizations • Beliefs

Knowledge

• Accountability • Foresight • Synthesis

Wisdom

• Symbols • Metrics • Facts

Data Correlation

Analysis

Application

Understanding

Complexity

Con

text

Communication

Repetition

From Data to Wisdom

Page 27: Brighttalk   outage insurance- what you need to know - final

x

y

0i i i iy xα α ε= + +Data

Information

Knowledge

Page 28: Brighttalk   outage insurance- what you need to know - final

Past Future

Abstract Tangible

Information Intelligence Knowledge Wisdom Data

Knowledge is the point of transition

Why Knowledge?

Page 29: Brighttalk   outage insurance- what you need to know - final

All You Need

Love

Page 30: Brighttalk   outage insurance- what you need to know - final

1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.!!!

Our systems are capable of producing a huge amount of data, both on the status of their own components and on the status of the environment. The problem with today’s systems is not a lack of information, but finding what is needed when it is needed.

Page 31: Brighttalk   outage insurance- what you need to know - final

Our success in any endeavor depends directly on our ability to solve problems

What do we need to do that?

Page 32: Brighttalk   outage insurance- what you need to know - final

You Gotta Have Skillz…!

Page 33: Brighttalk   outage insurance- what you need to know - final

Common Problem Types §  Design Problems §  Creative Problems §  Daily Problems §  People Problems

Rule-Based Approach

Event Based Approach

Page 34: Brighttalk   outage insurance- what you need to know - final

The Problem with the Rules-Based Approach •  Solutions are driven by accepted conventions •  Best practices are coveted and are adopted without

understanding how and why they were developed •  There must always be a right answer •  No logical analysis is required •  People are frequently seen as the “root cause” •  The outcomes are enforced using “re-dos” and

punitive actions (or the looming threat of these things)

Page 35: Brighttalk   outage insurance- what you need to know - final

Event-Based Problem Solving •  Appreciative Understanding •  Know What We Are Solving •  Create A Common Reality •  Solutions Based on Causes

Page 36: Brighttalk   outage insurance- what you need to know - final
Page 37: Brighttalk   outage insurance- what you need to know - final

The Pre-Mortem Process

Define the Problem

Chart the Causal

Relationships and Add EVidence

Identify Solutions

Implement the Solutions

Page 38: Brighttalk   outage insurance- what you need to know - final

Step 1: Define the Problem

Page 39: Brighttalk   outage insurance- what you need to know - final

Problem Definition •  What: •  When:

Date/Time: Relative: what was happening at the time of this event?

•  Where: Specific: Relative: logical dependencies?

•  Significance: availability: environment: costs: revenue maintenance? other miscellaneous costs frequency:

Page 40: Brighttalk   outage insurance- what you need to know - final

Gut Check…

•  Why are we working on this? •  How much time should we spend? •  What people do we need? •  How much money should we spend?

You should be able to answer all of the following:

Page 41: Brighttalk   outage insurance- what you need to know - final

The What Statement •  It is used as “The Primary Effect (PE)” –  It is a statement of what we want to prevent from

happening again •  There may be more than one –  If they are unrelated, perform separate RCA’s –  If they are related and you can’t decide which to

use, pick the one that is nearest to the present time

•  Noun/verb statement

Page 42: Brighttalk   outage insurance- what you need to know - final

Step 2: Add Causal Relationships and Evidence

Page 43: Brighttalk   outage insurance- what you need to know - final

The T: Drive reached 0 Bytes

free

The database stopped

processing queries

The application server was timing

out

Users were getting 500 errors

on the website

Customers to call the helpdesk to

complain Add more hard

drive space

Have you see something like this before?

What do we really know?

Page 44: Brighttalk   outage insurance- what you need to know - final

It’s never that simple

Customers Complaining

Web Server returning 500 errors

The application server was timing

out

SQL Server was not processing queries

Transaction log was unable to grow

T: Drive at 0 Bytes free

Logs were not truncated

DBA on honeymoon vacation in Fiji

Logs are truncated manually

Company has only 1 DBA

“Backup” DBA was not aware the logs require truncation

Space allocations are fixed Lack of Control

Only one database cluster in use

DR SQL Cluster

DR Cluster being used for UAT testing

More Information Needed

One one application server exists

More Information Needed

Trying to do business on the website Desired Condition

-AND-

-AND-

-AND-

-AND-

-AND-

-AND-

-AND-

Page 45: Brighttalk   outage insurance- what you need to know - final

Rules for Causal Relationships

Database Down !

(Effect)!

Drive Full (Cause/Effect)!

Logs Not Truncated (Cause)!

①  Causes are effects, and effects are causes!

Page 46: Brighttalk   outage insurance- what you need to know - final

Rules for Causal Relationships

End of the Universe (Effect)!

Database Down !(Primary Effect)!

Drive Full (Cause/Effect)!

Logs Not Truncated

(Cause/Effect)!Beginning of Time (Cause)!

②  You can keep identifying causes – there is no limit!

Page 47: Brighttalk   outage insurance- what you need to know - final

Two Important Questions

End of the Universe (Effect)!

Database Down !(Primary Effect)!

Drive Full (Cause/Effect)!

Logs Not Truncated

(Cause/Effect)!Beginning of Time (Cause)!

Ask “Why?”

Ask “What”

Page 48: Brighttalk   outage insurance- what you need to know - final

Rules for Causal Relationships

③  An Effect is often the result of multiple causes!

SQL Server was not processing queries (Effect)!

Transaction log was unable to grow!

T: Drive at 0 Bytes free!

Logs were not truncated!

DBA on honeymoon

vacation in Fiji!

Logs are truncated manually!

Company has only 1 DBA!

“Backup” DBA was not aware the logs require truncation!

Space allocations are fixed! Lack of Control!

-AND-!

-AND-!

-AND-!

Page 49: Brighttalk   outage insurance- what you need to know - final

Rules for Causal Relationships

④  Causes need to be both necessary and sufficient!

SQL Server was not processing queries

(Effect)!

Transaction log was unable to grow

(Transitory Cause)!

T: Drive at 0 Bytes free!(Non-transitory Cause

& Effect)!

Logs were not truncated!

(Transitory Cause & Effect)!

DBA on honeymoon vacation in Fiji!

(Transitory Cause)!

Logs are truncated manually!

(Non-Transitory Cause)!

Company has only 1 DBA!

(Non-Transitory Cause)!

“Backup” DBA was not aware the logs require

truncation!(Non-Transitory Cause)!

Space allocations are fixed!

(Non-Transitory Cause)!Lack of Control!

-AND-!

-AND-!

-AND-!

Page 50: Brighttalk   outage insurance- what you need to know - final

How Fire Works

Time

Oxygen Heat Fuel

Fire

Mat

ch S

trike

Transitory Non-Transitory

Fire

Oxygen

Heat

Fuel

Match Strike

-AND-

•  Transitory Causes act as catalysts to bring about change (think Transition)

•  Non-Transitory Causes are objects, properties/attributes, and status

Page 51: Brighttalk   outage insurance- what you need to know - final

RCA Diagram

Customers Complaining

Web Server returning 500 errors

The application server was timing

out

SQL Server was not processing queries

Transaction log was unable to grow

T: Drive at 0 Bytes free

Logs were not truncated

DBA on honeymoon vacation in Fiji

Logs are truncated manually

Company has only 1 DBA

“Backup” DBA was not aware the logs require truncation

Space allocations are fixed Lack of Control

Only one database cluster in use

DR SQL Cluster

DR Cluster being used for UAT testing

More Information Needed

One one application server exists

More Information Needed

Trying to do business on the website Desired Condition

-AND-

-AND-

-AND-

-AND-

-AND-

-AND-

-AND-

Page 52: Brighttalk   outage insurance- what you need to know - final

Add Evidence

Customers Complaining

Web Server returning 500 errors

The application server was timing

out

SQL Server was not processing queries

Transaction log was unable to grow

T: Drive at 0 Bytes free

Logs were not truncated

DBA on honeymoon vacation in Fiji

Logs are truncated manually

Company has only 1 DBA

“Backup” DBA was not aware the logs require truncation

Space allocations are fixed Lack of Control

Only one database cluster in use

DR SQL Cluster

DR Cluster being used for UAT testing

More Information Needed

One one application server exists

More Information Needed

Trying to do business on the website Desired Condition

-AND-

-AND-

-AND-

-AND-

-AND-

-AND-

-AND-

Statistical Data

Situational

Observation

Page 53: Brighttalk   outage insurance- what you need to know - final

Examples of Evidence •  Personal experience or observation •  Statistical data (Monitoring Metrics) •  Examples, particular events, or situations that

illustrate •  Analogies (comparisons with similar situations) •  Informed opinion (the opinions of experts and

authorities) •  Historical documentation •  Experimental evidence

Page 54: Brighttalk   outage insurance- what you need to know - final

Ideas for Finding Causes

Causes

Management

Organization

Process

Knowledge

Technology

People

Information

Applications

Infrastructure

Capital

Page 55: Brighttalk   outage insurance- what you need to know - final

Step 3: Find Solutions

Page 56: Brighttalk   outage insurance- what you need to know - final

Failure Modes Analysis

SQL Server Not Available

Transaction log is unable to grow

T: Drive at 0 Bytes free

Logs were not truncated

DBA on honeymoon vacation in Fiji

Logs are truncated manually

Company has only 1 DBA

“Backup” DBA was not aware the logs require

truncation (Condition Cause)

Space allocations are fixed

(Condition Cause) Lack of Control

SQL is unable to cache query results

Available RAM at 0 Bytes Free

C: Drive at 0 Bytes free

Minidump is configured to write to C: Drive

Server was ASRing frequently

Software distributions were leaving files in the

TEMP folder

%TEMP% configured to C:\Temp

Kernel able to write to page file

-AND-

-AND-

-AND-

-AND-

-OR-

-AND-

-OR-

Page 57: Brighttalk   outage insurance- what you need to know - final

Picking Monitors

SQL Server Not Available

Transaction log is unable to grow

T: Drive at 0 Bytes free

Logs were not truncated

DBA on honeymoon vacation in Fiji

Logs are truncated manually

Company has only 1 DBA

“Backup” DBA was not aware the logs require

truncation (Condition Cause)

Space allocations are fixed

(Condition Cause) Lack of Control

SQL is unable to cache query results

Available RAM at 0 Bytes Free

C: Drive at 0 Bytes free

Minidump is configured to write to C: Drive

Server was ASRing frequently

Software distributions were leaving files in the

TEMP folder

%TEMP% configured to C:\Temp

Kernel able to write to page file

-AND-

-AND-

-AND-

-AND-

-OR-

-AND-

-OR-

Monitor the intersections at

the “OR’s”

At least one point along each branch

after the “OR”

Page 58: Brighttalk   outage insurance- what you need to know - final

FMEA Matrix (Impact Calculation)

Negligible (1-2): no loss in functionality, mostly cosmetic Marginal (3-4): temporary interruptions or the degradation lasts for a brief period of time Critical (5-6): the problem will not resolve itself but a work around exists allowing the problem to be bypassed Serious (7-8): the problem will not resolve itself and no work around is possible. Functionality is impaired or lost but the system is usable to some extent Catastrophic (9-10): the system is completely unusable

Improbable (1-2): less than 1 time per year Remote (3-4): 1 time per year Occasional (5-6): 1 time per month Probable (7-8): 1 time per day Chronic (9-10): 1 or more times per day

Very high (1-2): during the design phase High (3-4): during peer review or unit testing Moderate (5-6): during system testing or acceptance testing Remote (7-8): during or immediately after production deployment Very Remote (9-10): only after heavy usage by users

Page 59: Brighttalk   outage insurance- what you need to know - final

FMEA Matrix (Evidence)

These are the events that help us to RULE IN a failure mode as a possible cause

These are the events that help us RULE OUT the failure mode as not relevant

Page 60: Brighttalk   outage insurance- what you need to know - final

Application-Technology Matrix Maps services, applications and technologies enabling: • Monitoring investment prioritization • Monitoring maturity • Which templates need to be deployed when new hardware is acquired • Whether an service has sufficient monitoring coverage based on its application components • This approach allows for anticipating changes to a customer’s monitoring needs

Scores indicate: 0 – No Strategy 1 – Limited Monitoring 2 – Fully Integrated Strategy

Page 61: Brighttalk   outage insurance- what you need to know - final

Step 4: Use this knowledge intelligently

Page 62: Brighttalk   outage insurance- what you need to know - final

During Service Support •  Command Centers and Support Teams – Use the failure modes to rule out causes – Each failure mode will have a documented process to

follow to mitigate the impact once the likely failure mode is identified

•  Incident Managers – Start bridge calls and provide an accounting of all the

potential failure modes, which have been successfully ruled out, and which need to be investigated

– Coordinate the investigation assignments and consolidate the investigation results

Page 63: Brighttalk   outage insurance- what you need to know - final

Facilitating Production Assurance •  CritSits

–  Start the CritSit meeting and provide an accounting of all the potential failure modes, which have been successfully ruled out, and which need to be investigated

–  Initiate investigations / experiments by assign potential failure modes to the incident response teams

•  Problem Management –  Document the causal elements as new failure modes –  Disseminate new failure modes to Architecture, the Monitoring

Team, and the Command Center/Service Desk •  Reporting

–  Produce a monthly news letter to application owners with the list of failure modes they should discuss with their architects

–  Incorporate failure modes into “Fault Line” analysis

Page 64: Brighttalk   outage insurance- what you need to know - final

During the Design Process •  Architects – Certify that designs do not contain the known failure

modes or document that the failure mode does not present an unacceptable risk

– Document the requirements for Solution Architects to follow to ensure the mitigation strategies are implemented

•  Developers – Certify that designs do not contain the known failure

modes or document that the failure mode does not present an unacceptable risk

– Certify the designs implement the mitigation strategies

Page 65: Brighttalk   outage insurance- what you need to know - final

Improving Enterprise Processes and Tools •  Systems Management and Monitoring –  Develop new monitoring requirements using the

documented indications and contraindications •  Event Management –  Develop new correlations tying indications and

contraindications to failure modes to assist in ruling out or ruling in those “in play” more efficiently

•  Configuration Management –  Develop new discovery patterns using the documented

indications and contraindications –  Develop automations to detect the presence of failure

mode conditions and generate an event to the Event Management System

Page 66: Brighttalk   outage insurance- what you need to know - final

A few final thoughts…

Page 67: Brighttalk   outage insurance- what you need to know - final

Running a Good Pre-Mortem Defer

Judgment Encourage Wild Ideas

Build on Ideas

Stay Focused One Person at a Time Be Visual

Go for Quantity

SUCCESSFUL RCA

Page 68: Brighttalk   outage insurance- what you need to know - final

Here is Why It Works

RCA Process

Re-Establishes

Personal Relationships

Social Networks

Cooling-Off Period

De-Escalating Gestures

Confidence-Building

Measures

Trust Building

Respect

Page 69: Brighttalk   outage insurance- what you need to know - final

Don’t try to create everything at once. Knowledge is something that is created over time.

Iterative Development

Page 70: Brighttalk   outage insurance- what you need to know - final

Let’s keep the conversation going…

[email protected]!

ReverendDrew!

SystemsManagementZen.Wordpress.com!

systemsmanagementzen.wordpress.com/feed/!

@SystemsMgmtZen!

ReverendDrew!

[email protected]!

614-306-3434!