brighttalk outage insurance- what you need to know - final
TRANSCRIPT
Outage Insurance: Everything You Need to Know
Mr. White has fifteen years of experience designing and managing the deployment of Systems Monitoring and Event Management software. Prior to joining IBM, Mr. White held various positions including the leader of the Monitoring and Event Management organization of a Fortune 100 company and developing solutions as a consultant for a wide variety of organizations, including the Mexican Secretaría de Hacienda y Crédito Público, Telmex, Wal-Mart of Mexico, JP Morgan Chase, Nationwide Insurance and the US Navy Facilities and Engineering Command.
Andrew White Cloud and Smarter Infrastructure Solution Specialist IBM Corporation
http://weheartit.com/entry/12433848!
Ground rules for this session… • If you can’t tell if I am trying to be funny… – GO AHEAD AND LAUGH!
• Feel free to text, tweet, yammer, or whatever to share with the rest of the attendees
• If you have a question, no need to wait until the end. Just interrupt me. Seriously… I don’t mind.
I am here today to share some of what I have learned about
We (IT) sells promises… The value of these promises depends on the customer’s perception that we are willing and capable of making good on the promise when the time comes. This perception is affected by the interactions they have with us.
http://www.flickr.com/photos/anneacaso/3693155059/sizes/l/in/photostream/!
Objective #1: Users Love Our IT Systems…
Anatomy of an Outage
Corporate!LANs & VPNs!
Load Balancer!
Firewall!
Web!Servers!
Message!Queue!
zOS!CICS!
WAS!
Database!
WAS!Database!
zOS!MQ!
DB2!
!!!!
4!
!!!!!!
3!
!!!!!!1!
5:45-ish pm: CICS ABENDS start flooding the console but not high enough to ticket!
!!!!!!2!
6:00-ish pm: MQ flows start are interrupted and are alerting in Flow Diagnostics!
6:04pm: Synthetic transactions fail at and 6:14 the Ops Center confirms the issue and creates a P0 Incident!
6:54pm: Support teams investigate the interrupted flows and determine it is a “back-end” problem!
10:29pm: Support teams investigate MQ and ultimately and rule it out and ultimately decide to reset CICS to resolve the issue!
!!!!
5!
http://www.flickr.com/photos/gregphoto/4881356366/sizes/l/in/photostream/!
Bad Experience!!!
h"p://www.ithakabound.com/wp-‐content/uploads/2010/02/DC-‐Snow-‐men-‐pushing-‐car.jpg
Why did this happen?!
Why is problem solving hard? • commencement opacity • continuation opacity
Non-transparency (lack of clarity of the situation)
• inexpressiveness • opposition • transience
Polytely (multiple goals)
• enumerability • connectivity (hierarchy relation, communication relation, allocation
relation) • heterogeneity
Complexity (large numbers of items, interrelations,
and decisions)
• temporal constraints • temporal sensitivity • phase effects • dynamic unpredictability
Dynamics (time considerations)
Boyd’s Loop
Observation
Outside Information
Implicit Guidance & Control
Unfolding Interaction With Environment Feedback
Feedback
Unfolding Circumstances Cultural
Norms
Cognitive Abilities
Knowledge Life Cycle
Prior Wisdom
New Information
Feed Forward Decision
(Hypothesis)
Feed Forward Action
(Test)
Feed Forward
• Note how observation shapes orientation, shapes decision, shapes action, and in turn is shaped by the feedback and other phenomena coming into our sensing or observing window.
• Also note how the entire “loop” (not just orientation) is an ongoing many-sided implicit cross-referencing process of projection, empathy, correlation, and rejection.
From “The Essence of Winning and Losing,” John R. Boyd, January 1996.
Observe Orient Decide Act
Where the Breakdown Occurs
Observe! Orient! Decide! Act!
Situational Awareness!
Perception of Elements in Current Situation!
!Level 1!
Comprehension of Current Situation!
!Level 2!
Projection of Future Status!
!!
Level 3!
Decision! Performance of Actions!
Cur
rent
Sta
te!
Feedback!
• Goals & Objectives!• Preconceptions!• Expectations!
• Abilities!• Experience!• Training!
Long Term Memory! Automaticity!
Cognitive Processes!
• System Capability!• Interface Design!• Stress & Workload!• Complexity!• Automation!
Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.!
Systemic Influences!
Individual Influences!
Incident Life Cycle
Down Time
Detection Time Response Time Repair Time Recovery Time Outage De
tect
ion
Diag
nosis
Repa
ir
Reco
ver
Rest
ore
Observe Orient Decide Act
Problem Life Cycle Evaluation
Recognition
Observation
Analysis Solution
Validation
Control
Point of Observation
Past Behavior • The observation period
used to feed the forecasting models
Future Behavior • The performance
period the model is trying to predict
Predictive Modeling Timeline
Predictive models harness the information lost in past data so you can identify discretely identify situations and react to them quickly.
What Matters Most?
Dr. Lee Goldman
Cook County Hospital, Chicago, IL
§ Is the patient feeling unstable angina?
§ Is there fluid in the patient’s lungs? § Is the patient’s systolic blood
pressure below 100?"
The Goldman Algorithm
Prediction of Patients Expected to Have a Heart Attack Within 72 Hours
0
20
40
60
80
100
Traditional Techniques Goldman Algorithm
By paying attention to what really matters, Dr. Goldman improved the “false negatives” by 20
percentage points and eliminated the “false positives” altogether.
The Goldman Algorithm
ECG Evidence of Acute Ischemia? ST-Segment Depression ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) or T- Wave Inversion in ≥ 2 Contiguous Leads (New or Unknown Age) or Left Bundle-Branch Block (New or Unknown Age)
Observation Unit
Inpatient Telemetry Unit
High Risk Low Risk Very Low Risk Moderate Risk
Yes No
Coronary Care Unit
No
ECG Evidence of Acute Myocardial Infarction (MI)? ST-Segment Elevation ≥ 1mm in ≥ 2 Contiguous Leads (New or Unknown Age) or Pathologic Q Waves in ≥ 2 Contiguous Leads (New or Unknown Age)
Yes
Patient suspected of Acute Cardiac
Ischema
Perform Electrocardiogram
(EKG)
0 Factors 2 or 3 Factors 1 Factors 0 or 1 Factors 2 or 3 Factors
Urgent Factors Present? Rates Above Both Lung Bases Systolic Blood Pressure <100 mm Hg Unstable Ischemic Heart Disease
Urgent Factors Present? Rates Above Both Lung Bases Systolic Blood Pressure <100 mm Hg Unstable Ischemic Heart Disease
First…
… we need to talk a little bit about your brain
The Triune Brain
Reptilian Brain (basal ganglia)
Mammalian Brain (limbic system)
Cognitive Brain (neocortex)
Our Thought Process
*** not very reliable
Cognition
Limbic Center (hypocampus and amygdala)
Cortex (hypocampus and amygdala)
Conscious Choice (via motor centers)
Most primitive, seat of unconscious
Long-term memory
Conscious, meaning, choice
Perception (via the senses)***
Pre-Frontal Cortex (hypocampus and amygdala)
Stimulus
Short Term Memory
Your Brain Working Memory Understanding Judgement Relationship
Short-term memory is where the real work of sense-making takes place
Short-term memory has a limited amount of space (The estimate is 7 ± 2)
The big-data dilema
Time
Qua
ntity
Information the brain can consume
Information is cheap. Understanding is expensive. -Karl Fast, Professor of UX Design, Kent State University
• Patterns • Comparisons • Organization
Information
• Decisions • Skill • Adaptation
Intelligence
• Trends • Generalizations • Beliefs
Knowledge
• Accountability • Foresight • Synthesis
Wisdom
• Symbols • Metrics • Facts
Data Correlation
Analysis
Application
Understanding
Complexity
Con
text
Communication
Repetition
From Data to Wisdom
x
y
0i i i iy xα α ε= + +Data
Information
Knowledge
Past Future
Abstract Tangible
Information Intelligence Knowledge Wisdom Data
Knowledge is the point of transition
Why Knowledge?
All You Need
Love
1. Adapted from Endsley, M.R. (1995b). Toward a theory of situation awareness in dynamic systems. Human Factors 37(1), 32–64.!!!
Our systems are capable of producing a huge amount of data, both on the status of their own components and on the status of the environment. The problem with today’s systems is not a lack of information, but finding what is needed when it is needed.
Our success in any endeavor depends directly on our ability to solve problems
What do we need to do that?
You Gotta Have Skillz…!
Common Problem Types § Design Problems § Creative Problems § Daily Problems § People Problems
Rule-Based Approach
Event Based Approach
The Problem with the Rules-Based Approach • Solutions are driven by accepted conventions • Best practices are coveted and are adopted without
understanding how and why they were developed • There must always be a right answer • No logical analysis is required • People are frequently seen as the “root cause” • The outcomes are enforced using “re-dos” and
punitive actions (or the looming threat of these things)
Event-Based Problem Solving • Appreciative Understanding • Know What We Are Solving • Create A Common Reality • Solutions Based on Causes
The Pre-Mortem Process
Define the Problem
Chart the Causal
Relationships and Add EVidence
Identify Solutions
Implement the Solutions
Step 1: Define the Problem
Problem Definition • What: • When:
Date/Time: Relative: what was happening at the time of this event?
• Where: Specific: Relative: logical dependencies?
• Significance: availability: environment: costs: revenue maintenance? other miscellaneous costs frequency:
Gut Check…
• Why are we working on this? • How much time should we spend? • What people do we need? • How much money should we spend?
You should be able to answer all of the following:
The What Statement • It is used as “The Primary Effect (PE)” – It is a statement of what we want to prevent from
happening again • There may be more than one – If they are unrelated, perform separate RCA’s – If they are related and you can’t decide which to
use, pick the one that is nearest to the present time
• Noun/verb statement
Step 2: Add Causal Relationships and Evidence
The T: Drive reached 0 Bytes
free
The database stopped
processing queries
The application server was timing
out
Users were getting 500 errors
on the website
Customers to call the helpdesk to
complain Add more hard
drive space
Have you see something like this before?
What do we really know?
It’s never that simple
Customers Complaining
Web Server returning 500 errors
The application server was timing
out
SQL Server was not processing queries
Transaction log was unable to grow
T: Drive at 0 Bytes free
Logs were not truncated
DBA on honeymoon vacation in Fiji
Logs are truncated manually
Company has only 1 DBA
“Backup” DBA was not aware the logs require truncation
Space allocations are fixed Lack of Control
Only one database cluster in use
DR SQL Cluster
DR Cluster being used for UAT testing
More Information Needed
One one application server exists
More Information Needed
Trying to do business on the website Desired Condition
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
Rules for Causal Relationships
Database Down !
(Effect)!
Drive Full (Cause/Effect)!
Logs Not Truncated (Cause)!
① Causes are effects, and effects are causes!
Rules for Causal Relationships
End of the Universe (Effect)!
Database Down !(Primary Effect)!
Drive Full (Cause/Effect)!
Logs Not Truncated
(Cause/Effect)!Beginning of Time (Cause)!
② You can keep identifying causes – there is no limit!
Two Important Questions
End of the Universe (Effect)!
Database Down !(Primary Effect)!
Drive Full (Cause/Effect)!
Logs Not Truncated
(Cause/Effect)!Beginning of Time (Cause)!
Ask “Why?”
Ask “What”
Rules for Causal Relationships
③ An Effect is often the result of multiple causes!
SQL Server was not processing queries (Effect)!
Transaction log was unable to grow!
T: Drive at 0 Bytes free!
Logs were not truncated!
DBA on honeymoon
vacation in Fiji!
Logs are truncated manually!
Company has only 1 DBA!
“Backup” DBA was not aware the logs require truncation!
Space allocations are fixed! Lack of Control!
-AND-!
-AND-!
-AND-!
Rules for Causal Relationships
④ Causes need to be both necessary and sufficient!
SQL Server was not processing queries
(Effect)!
Transaction log was unable to grow
(Transitory Cause)!
T: Drive at 0 Bytes free!(Non-transitory Cause
& Effect)!
Logs were not truncated!
(Transitory Cause & Effect)!
DBA on honeymoon vacation in Fiji!
(Transitory Cause)!
Logs are truncated manually!
(Non-Transitory Cause)!
Company has only 1 DBA!
(Non-Transitory Cause)!
“Backup” DBA was not aware the logs require
truncation!(Non-Transitory Cause)!
Space allocations are fixed!
(Non-Transitory Cause)!Lack of Control!
-AND-!
-AND-!
-AND-!
How Fire Works
Time
Oxygen Heat Fuel
Fire
Mat
ch S
trike
Transitory Non-Transitory
Fire
Oxygen
Heat
Fuel
Match Strike
-AND-
• Transitory Causes act as catalysts to bring about change (think Transition)
• Non-Transitory Causes are objects, properties/attributes, and status
RCA Diagram
Customers Complaining
Web Server returning 500 errors
The application server was timing
out
SQL Server was not processing queries
Transaction log was unable to grow
T: Drive at 0 Bytes free
Logs were not truncated
DBA on honeymoon vacation in Fiji
Logs are truncated manually
Company has only 1 DBA
“Backup” DBA was not aware the logs require truncation
Space allocations are fixed Lack of Control
Only one database cluster in use
DR SQL Cluster
DR Cluster being used for UAT testing
More Information Needed
One one application server exists
More Information Needed
Trying to do business on the website Desired Condition
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
Add Evidence
Customers Complaining
Web Server returning 500 errors
The application server was timing
out
SQL Server was not processing queries
Transaction log was unable to grow
T: Drive at 0 Bytes free
Logs were not truncated
DBA on honeymoon vacation in Fiji
Logs are truncated manually
Company has only 1 DBA
“Backup” DBA was not aware the logs require truncation
Space allocations are fixed Lack of Control
Only one database cluster in use
DR SQL Cluster
DR Cluster being used for UAT testing
More Information Needed
One one application server exists
More Information Needed
Trying to do business on the website Desired Condition
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
-AND-
Statistical Data
Situational
Observation
Examples of Evidence • Personal experience or observation • Statistical data (Monitoring Metrics) • Examples, particular events, or situations that
illustrate • Analogies (comparisons with similar situations) • Informed opinion (the opinions of experts and
authorities) • Historical documentation • Experimental evidence
Ideas for Finding Causes
Causes
Management
Organization
Process
Knowledge
Technology
People
Information
Applications
Infrastructure
Capital
Step 3: Find Solutions
Failure Modes Analysis
SQL Server Not Available
Transaction log is unable to grow
T: Drive at 0 Bytes free
Logs were not truncated
DBA on honeymoon vacation in Fiji
Logs are truncated manually
Company has only 1 DBA
“Backup” DBA was not aware the logs require
truncation (Condition Cause)
Space allocations are fixed
(Condition Cause) Lack of Control
SQL is unable to cache query results
Available RAM at 0 Bytes Free
C: Drive at 0 Bytes free
Minidump is configured to write to C: Drive
Server was ASRing frequently
Software distributions were leaving files in the
TEMP folder
%TEMP% configured to C:\Temp
Kernel able to write to page file
-AND-
-AND-
-AND-
-AND-
-OR-
-AND-
-OR-
Picking Monitors
SQL Server Not Available
Transaction log is unable to grow
T: Drive at 0 Bytes free
Logs were not truncated
DBA on honeymoon vacation in Fiji
Logs are truncated manually
Company has only 1 DBA
“Backup” DBA was not aware the logs require
truncation (Condition Cause)
Space allocations are fixed
(Condition Cause) Lack of Control
SQL is unable to cache query results
Available RAM at 0 Bytes Free
C: Drive at 0 Bytes free
Minidump is configured to write to C: Drive
Server was ASRing frequently
Software distributions were leaving files in the
TEMP folder
%TEMP% configured to C:\Temp
Kernel able to write to page file
-AND-
-AND-
-AND-
-AND-
-OR-
-AND-
-OR-
Monitor the intersections at
the “OR’s”
At least one point along each branch
after the “OR”
FMEA Matrix (Impact Calculation)
Negligible (1-2): no loss in functionality, mostly cosmetic Marginal (3-4): temporary interruptions or the degradation lasts for a brief period of time Critical (5-6): the problem will not resolve itself but a work around exists allowing the problem to be bypassed Serious (7-8): the problem will not resolve itself and no work around is possible. Functionality is impaired or lost but the system is usable to some extent Catastrophic (9-10): the system is completely unusable
Improbable (1-2): less than 1 time per year Remote (3-4): 1 time per year Occasional (5-6): 1 time per month Probable (7-8): 1 time per day Chronic (9-10): 1 or more times per day
Very high (1-2): during the design phase High (3-4): during peer review or unit testing Moderate (5-6): during system testing or acceptance testing Remote (7-8): during or immediately after production deployment Very Remote (9-10): only after heavy usage by users
FMEA Matrix (Evidence)
These are the events that help us to RULE IN a failure mode as a possible cause
These are the events that help us RULE OUT the failure mode as not relevant
Application-Technology Matrix Maps services, applications and technologies enabling: • Monitoring investment prioritization • Monitoring maturity • Which templates need to be deployed when new hardware is acquired • Whether an service has sufficient monitoring coverage based on its application components • This approach allows for anticipating changes to a customer’s monitoring needs
Scores indicate: 0 – No Strategy 1 – Limited Monitoring 2 – Fully Integrated Strategy
Step 4: Use this knowledge intelligently
During Service Support • Command Centers and Support Teams – Use the failure modes to rule out causes – Each failure mode will have a documented process to
follow to mitigate the impact once the likely failure mode is identified
• Incident Managers – Start bridge calls and provide an accounting of all the
potential failure modes, which have been successfully ruled out, and which need to be investigated
– Coordinate the investigation assignments and consolidate the investigation results
Facilitating Production Assurance • CritSits
– Start the CritSit meeting and provide an accounting of all the potential failure modes, which have been successfully ruled out, and which need to be investigated
– Initiate investigations / experiments by assign potential failure modes to the incident response teams
• Problem Management – Document the causal elements as new failure modes – Disseminate new failure modes to Architecture, the Monitoring
Team, and the Command Center/Service Desk • Reporting
– Produce a monthly news letter to application owners with the list of failure modes they should discuss with their architects
– Incorporate failure modes into “Fault Line” analysis
During the Design Process • Architects – Certify that designs do not contain the known failure
modes or document that the failure mode does not present an unacceptable risk
– Document the requirements for Solution Architects to follow to ensure the mitigation strategies are implemented
• Developers – Certify that designs do not contain the known failure
modes or document that the failure mode does not present an unacceptable risk
– Certify the designs implement the mitigation strategies
Improving Enterprise Processes and Tools • Systems Management and Monitoring – Develop new monitoring requirements using the
documented indications and contraindications • Event Management – Develop new correlations tying indications and
contraindications to failure modes to assist in ruling out or ruling in those “in play” more efficiently
• Configuration Management – Develop new discovery patterns using the documented
indications and contraindications – Develop automations to detect the presence of failure
mode conditions and generate an event to the Event Management System
A few final thoughts…
Running a Good Pre-Mortem Defer
Judgment Encourage Wild Ideas
Build on Ideas
Stay Focused One Person at a Time Be Visual
Go for Quantity
SUCCESSFUL RCA
Here is Why It Works
RCA Process
Re-Establishes
Personal Relationships
Social Networks
Cooling-Off Period
De-Escalating Gestures
Confidence-Building
Measures
Trust Building
Respect
Don’t try to create everything at once. Knowledge is something that is created over time.
Iterative Development
Let’s keep the conversation going…
ReverendDrew!
SystemsManagementZen.Wordpress.com!
systemsmanagementzen.wordpress.com/feed/!
@SystemsMgmtZen!
ReverendDrew!
614-306-3434!