intelligent monitoring
DESCRIPTION
This presentation describes a intelligent IT monitoring solution that uses Nagios as source of information, Esper as the CEP engine and a PCA algorithm.TRANSCRIPT
Intelligent Monitoring
Denis A. Vieira Jr.
Ricardo Clemente
Intelligent Monitoring
Summary:
Motivation
Where are we?
Where are we going?
Action Plan
Event Correlation
Intelligent Monitoring
Summary:
Motivation
Where are we?
Where are we going?
Action Plan
Event Correlation
Motivation:
Only ponctual monitoring available
Decrease time to repair incidents
Proactive monitoring
Realistic view from live environment
Intelligent Monitoring
Motivation:
Learn (identify patterns )
Automation
Store historical data with no loss
Improve credibility and Situational Awareness
Intelligent Monitoring
Intelligent Monitoring
Summary:
Motivation
Where are we?
Where are we going?
Action Plan
Event Correlation
Where are we?:
Lots of information (1200 servers with more than 14000 monitors)
– more than 40000 graphs being plot
Lots of tools for monitoring running (SME, IPMonitor, Cricket,
SiteScope, SiteSeer, Logs)
Difficulties with specific customizations, performance and cost
No credibility (lots of emails) with alarms. But much better than
before.
Intelligent Monitoring
Intelligent Monitoring
Summary:
Motivation
Where are we?
Where are we going?
Action Plan
Event Correlation
Were are we going:
Use of events. E.g.: Appenders for log frameworks to integrate
information from applications
Knowledge to antecipate undesired situations
Unified interface for monitoring
Root cause detection
Intelligent Monitoring
Intelligent Monitoring
Summary:
Motivation
Where are we?
Where are we going?
Action Plan
Event Correlation
Intelligent Monitoring
Action Plan:
Unify the monitoring tools with Nagios (scalability and integration)
Integrate Nagios with correlation system using NEB (Nagios Event
Broker)
available ate:
code.google.com/p/neb2activemq
Map event and systems to correlate
(manual and analytic task)
Intelligent Monitoring
Summary:
Motivation
Where are we?
Where are we going?
Action Plan
Event Correlation
Orverview and system architecture
Event Bus
Correlation tecnique
Correlation egine
Visualization
Machine Learning
Project
Overview and system architecture
Modular and event-driven architecture
EVENT BUS
CORRELATION
ENGINE
MACHINE LEARN
COLLECTOR
VISUALIZATION
What is the system architecture?
Unique bus for message exchange
Modules are separte process for operating system and can be on
differente machines
Modules can publish / subscribe to queue / topic from bus
Why an Event Driven Architecture ?
Loose coupled e Distributed
Less intrusive for monitored systems
Modules are independent
Overview and system architecture
Event bus
Open source project
Chosen Apache ActiveMQ:
Stable
Performance
Active Comunity
Conectivity
JMS
STOMP
REST
XMPP (...)
Event Bus
Message format
JSON ( not XML)
Simplicity
Structure
Header : channel type(queue or topic) and event type
Body: data
$ curl -d "type=queue&body={'idle'=70, 'sys’=20,
'usr'=10, 'host'='ws122' }&eventtype=CPU"
http://barramento/message/events;
Correlation Technique
CEP (Complex Event Processing )
Technology that enables processing mutiple events in real time with
the goal to identify meaningful events
Based on rules or queries (“SQL like”)
Queries created on execution time
History
On1995, professor David Luckham from Stanford, working on Rapide
project coined the term CEP
Database research topic: Data Stream Management Systems (DSMS)
Correlation technique
Query Processing
Memory
DadosDadosData
Persistents relations
query answer
Processamento de
consultas
Memória
dados dados
continuos
queryanswer
Data stream
“upside down database”
Correlation Technique
Marketing
Trend(Buzz)
CEP market is estimated on 460 milion dolars by 2010 (source: IEEE
Computer Society – April 2009)
Useful where there are data streams and necessity to extract
information on real time from that data
Financial Market
Logistic process (RFID)
Airport control
ICUs
Datacenters
Correlation Technique
Big Players
Correlation Technique
Open Source Players
Academic projects:
STREAM – Stanford – 2003 (officialy deprecated)
TelegraphCQ – Berkeley - 2003
Based on PostgreSQL 7.3.2
No activity
Cayuga – Cornell
From the industry:
Esper, a codehaus project complete in terms features
Compact syntax and flexible
Excelent documentation
Performance
Our choice!
Correlation Engine
If session raised 10% on the
last 3 min, and the average
from Servers cpu didn’t raise
5%, and Mysql slow queries
are above 10, so there is a
database retention causing
users to queue
Application
Correlation Engine
Application
Mysql
Server
Vip
t – 3 min t
t – 3 min t
t
cpu_usr
slow_query
session
SELECT Server.host , Server.cpu_usr, Server_PAST.cpu_usr, Vip.session,
Vip_PAST.session, Mysql.slow_query
FROM
Server.win:time(1 min) as Server,
Server.win:ext_timed(current_timestamp(), 3 min) as Server_PAST,
Vip.win:time(1 min) as Vip,
Vip.win:ext_timed(current_timestamp(), 3 min) as Vip_PAST ,
Mysql.win:time (1min) as Mysql
HAVING
Vip.session > Vip_PAST.session * 1.10 AND
avg(Server.cpu_usr) < avg (Server_PAST.cpu_usr) * 1.05 AND
Mysql.slow_query > 10
Correlation Engine
Application
Identifing na outlier
select host, free, avg(free)
from Memory.win:time(240 sec) group by host
having free < avg(free)
Events sequence
select * from
pattern [every Memory(free < 10) ->
(timer:interval(60 sec) and Log(text like ‘%OutOfMemory%’)) ]
Schedule and extensions
select idle from pattern [every timer:at(*, [16:22], *, [0,3], *) ].win:time(30
sec), CPU.win:time(30) where idle < 30 AND Filter.isInNode(id,
“Sports.BigFarm")
Correlation Engine
Motor de correlação
Source: Esper Performance - http://docs.codehaus.org/display/ESPER/Esper+performance
Item Especificação
HW Servidor Esper 2 x Intel Xeon 5130 2GHz (4 cores total), 16GB RAM
VM config -Xms2g -Xmx2g -Xns128m -Xgc:gencon
Consulta # cons. evt/s Latência Latência
média
Nota
select '$' as ticker from
Market(ticker='$').win:lengt
h(1000).stat:weighted_avg('p
rice', 'volume') output last
every 30 seconds
1000 519 728 99.66% <
10us
2.8us CPU com 85%,
70 Mbit/s
Performance Esper
Correlation engine
Process inside Correlaion engine
Visualization – Console
Quering the live environment
Visualization – Troubleshooting
Antecipating and solving incidents quicker
Visualization- Dashboard
Consolidate view of environment
What about unseen problems?
Machine Learning
Choice for non-supervised and incremental algorithms
Incremental PCA
Transforms a number of possible correlated variables in a minor
number of non-correlated, the principal componnents
A change on principal componnents means a broken correlation, or
annomaly
Can be used for data compression
Inspired on a paper from Carnegie Mellon University (Hoke et al. 2006)
Source: http://www.pdl.cmu.edu/PDL-FTP/SelfStar/osr_sub.pdf
Implementation had two main challenges: measures with missing values
and different scales
60 input signals
Machine Learning
Summarized on 1 principal component + gerenation matriz
Machine Learning
Second principal component
sensibility
three annomaly
Machine Learning
Project
Status
Developed all functionalities
Algorithms being validated through tests with
RRDs and meeting with operation team
Performance tests on going
System on live enviroment with reduced scope
Project at Globo.com – Next challenges
Scale
Events“Sharding”
Rule balance
Cache
Otimize algorithm
Adaptative control of memory and sensibility parameters
Insert a supervisioned layer
Other algorithms to cooperate
Intelligent Monitoring
Final considerations
References
http://delicious.com/fisl10
Questions
Contacts
Denis A. Vieira Jr
[email protected] (www.globo.com)
Ricardo Clemente
[email protected] (www.intelie.com.br)
Globo.com stand
This afternoon
Raise your hand!