elk possibilities: log management and beyond
TRANSCRIPT
ELK Possibilities
- Log Management And Beyond
Ashish Billore
ELK Possible UseCases
Log Management:– Capture logs from all the services on all the nodes– Present in a centralized dashboard for easy query / visualization
Serviceability, Troubleshooting and Debug Tool:– Use ELK to query and report logs filtered:
On time range, Specific Node, Any keyword
Regulatory and Audit Requirements:– Retain logs safe with Elasticsearch database snapshots
Monitoring:– Using CollectD agent, Monitor OS Nodes: CPU, Mem, Disk, I/O, Openipmi*– Present metrics in a centralized dashboard for easy query / visualization
Alerting based on events: – Using ELK elastalert plugin, alert on resource metrics threshold. Of interest are:
CPU, Mem, Disk, I/O, ipmi based thresholds
Components and Config Options
ELK:– Components Needed and Purpose:
E, L, K Curator: Elasticsearch index management plugin: Purging older indexes Redis: In front of Logstash to optimize log ingestion in heavy traffic or logstash outage Elasticsearch to be configured with replication factor of 1, this will replicate elasticsearch documents on both nodes.
Logstash need to point to Elasticsearch cluster for indexed output
Potential Issues for Investigation: – Elasticsearch in cluster mode has master and data-nodes. Master election requires
min 3 nodes to avoid split-brain problem – Logstash is CPU intensive: need to allocate multiple core / threads for larger
datasets or too many indexes– Elasticsearch is RAM intensive, deployments with large data need proportional RAM
ELK + CollectD: Practical Considerations
Practical investigation tasks on ELK + CollectD to determine: How much overhead?
CPU, memory, disk, network usage by each server component (E,L,K)? Data increments (how much the disk usage grows overtime)? How much resource consumed by agents on nodes: CollectD, rsyslog?
Non-Functional requirements: Scalability HA recovery etc.
Reason for Optimism: ELK Combination heavily used For Distributed, scale-out applications:
OpenStack, Clouds, in communityProduction deployments at: Facebook, Uber, NetFlix, eBay, stackoverflow, Verizon, OpenStack gerrit, NASA ..More Info: https://www.elastic.co/use-cases
Add-on Functionalities
1. Metric collection / Reporting with CollectD: Deploy single node ELK with CollectD plugin and capture following metric data:
o CPU, Memory, Disk, I/O and IPMIo
2. Elastalert Plugin for Alerts On above setup configure Elastalert plugin Setup alerts for certain metrics: CPU, Disk space, network I/O or errors in logs Configure notification (e.g. through email/dashboard alerts)
3. Curator plugin for elasticsearch index clean-up: Configure above setup with Curator plugin Create various policies for index cleanup:
o Based on time range: index older than 1 weeko Based on condition: If disk is 80% full
4. Snapshot and restore for data backup / restore In above setup take snapshot of elasticsearch database Restore an empty elasticsearch database with above snapshot
Experiment Links for reference
ELK Data Management: https://www.elastic.co/guide/en/elasticsearch/guide/current/retiring-data.html#archive-indices https://www.elastic.co/guide/en/elasticsearch/guide/current/retiring-data.html#retiring-data https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-snapshots.html ELK Cluster, Scaling and Failover: https://www.elastic.co/guide/en/logstash/current/deploying-and-scaling.html https://www.elastic.co/guide/en/elasticsearch/guide/current/_add_failover.html https://www.elastic.co/guide/en/elasticsearch/guide/current/backing-up-your-cluster.html ELK Curator for roll-over, cleanup: https://www.elastic.co/guide/en/elasticsearch/client/curator/3.5/getting-started.html ELK Alerts (Elastalert): https://github.com/Yelp/elastalert ELK resource monitoring with CollectD: https://mtalavera.wordpress.com/2015/02/16/monitoring-with-collectd-and-kibana/ https://collectd.org/wiki/index.php/Table_of_Plugins
Quick Recap
Log Integration Framework (Apache 2.0 License): Log Collection Centralization Parsing Storage and Search Visualization: searchable Time-series Dashboards
Scale Log Management as Systems/Cloud Scale
Horizontally Scale each component as neededDevOps Friendly with chef, ansible and puppet scripts
ELK : Generic Log Management and Beyond
ELK being a generic framework: Leverage for any system that can generate logs or data over-time: Openstack, Syslog / rsyslog, DB logs, any other application logs, metrics Easy to on-board any application / service in future
• Dashboards: Time-series data visualization Multiple views for multiple information for multiple users (can be
authorization-secured) Embeddable in other widgets / pages / Views
• Elasticsearch: “google-search” for entire system logs Search available over REST JSON based indexed DB
• ELK Combination heavily used: For Distributed, scale-out applications, openstack, Clouds, in community
ELK + CollectD for Utilization Data Visualization, trending
• CollectD- Daemon to collect system performance statistics periodically- Provides mechanisms to store values in a variety of ways- OpenSource, actively developed and 90+ plugins out-of-box to use
Sample Results:
ELK + CollectD for Utilization Data Visualization, trending