Transcript
Page 1: Scaling ELK Stack - DevOpsDays Singapore

ELKLog processing at Scale

#DevOpsDays 2015, Singapore@DevOpsDaysSG

Angad Singh

Page 2: Scaling ELK Stack - DevOpsDays Singapore

About meDevOps at Viki, Inc - A global video streaming site with subtitles.

Previously a Twitter SRE, National University of Singapore

Twitter @angadsg,

Github @angad

Page 3: Scaling ELK Stack - DevOpsDays Singapore

Elasticsearch - Log Indexing and Searching

Logstash - Log Ingestion plumbing

Kibana - Frontend{

Page 4: Scaling ELK Stack - DevOpsDays Singapore

Metrics vs LoggingMetrics

● Numeric timeseries data

● Actionable

● Counts, Statistical (p90, p99 etc.)

● Scalable cost-effective solutions

already available

Page 5: Scaling ELK Stack - DevOpsDays Singapore

Logging

● Useful for debugging

● Catch-all

● Full text searching

● Computationally intensive, harder

to scale

Metrics vs LoggingMetrics

● Numeric timeseries data

● Actionable

● Counts, Statistical (p90, p99 etc.)

● Scalable cost-effective solutions

already available

Page 7: Scaling ELK Stack - DevOpsDays Singapore

Logs● Application logs - Stack Traces, Handled Exceptions

● Access Logs - Status codes, URI, HTTP Method at all levels of the stack

● Client Logs - Direct HTTP requests containing log events from client-side

Javascript or Mobile application (android/ios)

● Standardized log format to JSON - easy to add / remove fields.

● Request tracing through various services using Unique-ID at Load Balancer

Page 8: Scaling ELK Stack - DevOpsDays Singapore

● Log aggregator● Log preprocessing

(Filtering etc.)● 3 stage pipeline● Input > Filter > Output

Logstash

Page 9: Scaling ELK Stack - DevOpsDays Singapore

● Log aggregator● Log preprocessing

(Filtering etc.)● 3 stage pipeline● Input > Filter > Output

Logstash Elasticsearch● Full text searching and

indexing● on top of Apache

Lucene● RESTful web interface● Horizontally scalable

Page 10: Scaling ELK Stack - DevOpsDays Singapore

● Log aggregator● Log preprocessing

(Filtering etc.)● 3 stage pipeline● Input > Filter > Output

Logstash Elasticsearch● Full text searching and

indexing● on top of Apache

Lucene● RESTful web interface● Horizontally scalable

Kibana● Frontend● Visualizations,

Dashboards● Supports Geo

visualizations● Uses ES REST API

Page 11: Scaling ELK Stack - DevOpsDays Singapore
Page 12: Scaling ELK Stack - DevOpsDays Singapore

Input

Any Stream

● local file● queue● tcp, udp● twitter● etc..

LogstashFilter

Mutation

● add/remove field● parse as json● ruby code● parse geoip● etc..

Output

● elasticsearch● redis● queue● file● pagerduty● etc..

Page 13: Scaling ELK Stack - DevOpsDays Singapore

● Golang program that sits next to log files, lumberjack protocol

● Forwards logs from a file to a logstash server

● Removes the need for a buffer (such as redis, or a queue) for

logs pending ingestion to logstash.

● Docker container with volume mounted /var/log.

Configuration stored in Consul.

● Application containers with volume mounted /var/log to

/var/log/docker/<container>/application.log

Logstash Forwarder

Page 14: Scaling ELK Stack - DevOpsDays Singapore

Logstash pool with HAProxy4 x logstash machines, 8 cores, 16 GB RAM

7 x logstash processes per machine, 5 for application logs, 2 for HTTP client logs.

Fronted by HAProxy for both lumberjack protocol as well as HTTP protocol.

Easily scalable by adding more machines and spinning up more logstash processes.

Page 16: Scaling ELK Stack - DevOpsDays Singapore

Elasticsearch Hardware12 core, 64GB RAM with RAID 0 - 2 x 3TB 7200rpm disks.

20 nodes, 20 shards, 3 replicas (with 1 primary).

Each day ~300GB x 4 copies (3 + 1) ~ 3 months of data on 120TB.

Average 6k-8k logs per second, peak 25k logs per second.

https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html

Page 17: Scaling ELK Stack - DevOpsDays Singapore

Elasticsearch Hardware

Page 18: Scaling ELK Stack - DevOpsDays Singapore

● < 30.5 GB Heap - JAVA compressed pointers below 30.5GB heap● Sweet spot - 64GB of RAM with half available for Lucene file buffers.● SSD or RAID 0 (or multiple path directories similar to RAID 0). ● If SSD then set I/O scheduler to deadline instead of cfq.● RAID0 - no need to worry about disks failing as machines can easily be

replaced due to multiple copies of data.● Disable swap.

Hardware Tuning

Page 19: Scaling ELK Stack - DevOpsDays Singapore

● 20 days of indexes open based on available memory, rest closed - open on demand

● Field data - cache used while sorting and aggregating data.● Circuit breaker - cancels requests which require large memory, prevent OOM,

http://elasticsearch:9200/_cache/clear if field data is very close to memory limit.

● Shards >= Number of nodes● Lucene forceMerge - minor performance improvements for older indexes

(https://www.elastic.co/guide/en/elasticsearch/client/curator/current/optimize.html)

Elasticsearch Configuration

Page 20: Scaling ELK Stack - DevOpsDays Singapore

Prevent split brain situation to avoid losing data - set minimum number of master eligible nodes to (n/2 + 1)

Set higher ulimit for elasticsearch process

Daily cronjob which deletes data older than 90 days, closes indices older than 20 days, optimizes (forceMerge) indices older than 2 days

And also...

Page 21: Scaling ELK Stack - DevOpsDays Singapore
Page 22: Scaling ELK Stack - DevOpsDays Singapore

Marvel - Official plugin from Elasticsearch

KOPF - Index management plugin

CAT APIs - REST APIs to view cluster information

Curator - Data management

Monitoring

Page 23: Scaling ELK Stack - DevOpsDays Singapore

Thanksemail: [email protected]

twitter: @angadsg


Top Related