using data science to automate event correlation - june 2016 - dan turchin - bigpanda
TRANSCRIPT
REDUCING ALERT NOISE WITH DATA SCIENCE
PREPARED FOR THE FUTURE STATE OF OPERATIONS MEETUP
DAN TURCHIN | @DTURCHIN | BIGPANDA | JUNE 2016
t
OBJECTIVES
1. Discuss why data’s eating the world
2. Share how data science is solving the noisy alert problem
3. Discuss the state of innovation… and our role in it
4. Learn from each other
t
DATA IS EATING THE WORLD
DATA SCIENCE
Using all available data to make better
business decisions.
MACHINE LEARNING
Automating the use of statistics to infer
future behavior from past results.
t
DATA SCIENCE + MACHINE LEARNING: CASE STUDY
WHY DON’T UPS TRUCKS MAKE LEFT TURNS?
• Fuel efficiency
• Maintenance records
• Accident reports
• Driver health data
• On-time deliveries
• Package returns
• Customer surveys
• Objective: improve
service and reduce costs
• Hypotheses: minimize
miles traveled, avoid rush
hour
• Collect and analyze data
• Conclusion: only right
turns!
t“…AND IT OPS DESERVES CREDIT (AND BLAME)
JAMES TURNBULL, THE ART OF MONITORING
Applications and services are now critical for customer satisfaction. IT is no longer
just a cost center. There are more hosts, applications and infrastructure are more
complex, and expectations around availability and quality are more aggressive. More
data is needed to deliver the same quality of service and often that data isn’t being collected
or is hard to find. Legacy approaches to monitoring no longer work.”
t
THE STATE OF MONITORING… IS POOR
• 80% AGREE THAT MONITORING
IS STRATEGIC.
• 12% ARE SATISFIED WITH THEIR
STRATEGY. http://bit.ly/BP_SoM
75% RECEIVE MORE THAN 50
ALERTS PER DAY.
31% OF THOSE WITH MTTR GREATER
THAN 24 HOURS ARE SATISFIED WITH
THEIR MONITORING STRATEGIES… VS.
63% WITH LOWER MTTR.
t
Aler
ts p
er m
onth
0
4,500
9,000
13,500
18,000
2000 2005 2010 2015 2020
… AND NOISE LEVELS ARE INCREASING
t
…BUT HEADCOUNT ISN’T
2000 2020
• 5 incidents per engineer per day
• 96 minutes per incident
• 400 incidents per engineer per day
• 1.2 minutes per incident
t
WHAT’S THE BEST WAY TO AUTOMATE EVENT CORRELATION?
HEURISTICS NLP
ASSISTED UNASSISTED ASSISTED UNASSISTED
• Optimal for: dynamic models where
new inputs affect outputs • Examples
• Air pollution • InfoSec • IT Ops
• Optimal for: static models where known
inputs have predictable outputs • Examples
• Migration patterns • Molecular sequencing • Mine detection
t
COR·RE·LA·TIONˌkôrəˈlāSH(ə)n
The extent to which two variables have a
linear relationship.
ARE THESE EVENTS RELATED… OR CORRELATED?
• • • • • • • •• • • • • • • •
• • • • • • • •
• • • • • • • •10 MINUTES
HEURISTICS-BASED CORRELATION
t
ARE THESE EVENTS RELATED… OR CORRELATED?
• • • • • • • •• • • • • • • •
• • • • • • • •
• • • • • • • •10 MINUTES
“WHENEVER THERE’S A CPU ISSUE IT’S
FOLLOWED BY A QUERY ERROR AND A DISK
I/O ISSUE WITHIN 5 MINUTES WHEN HOSTS
ARE IN THE SAME CLUSTER.”
•••
CPU SPIKE
LONG QUERY EXECUTION
DISK I/O BUFFER
• SAME CLUSTER
HEURISTICS-BASED CORRELATION
t
tag(s) time window filter = matching events+ +DEFINITION
cluster 30 minutes source_system=api.* AND cluster NOT IN [“stage-*”] = matching events+ +EXAMPLE
All the alerts in an incident correlated by this rule will have the same cluster, the time between the creation of the first and most recent alert will be no more than 30 minutes,
and all matching alerts will meet the filter conditions.
SAMPLE HEURISTIC
t
WHO CARES?
What should I work on next?
What’s about to break?
How does that impact the business?
PRIORITIZE
INVESTIGATE
PREVENT
t
“THE BEST WAY OUT IS ALWAYS THROUGH.”
-ROBERT FROST
DAN TURCHIN | BIGPANDA
[email protected] | @DTURCHIN | (650)533-0918