scaling elk stack - devopsdays singapore

Download Scaling ELK Stack - DevOpsDays Singapore

Post on 17-Feb-2017

385 views

Category:

Internet

4 download

Embed Size (px)

TRANSCRIPT

  • ELKLog processing at Scale

    #DevOpsDays 2015, Singapore@DevOpsDaysSG

    Angad Singh

  • About meDevOps at Viki, Inc - A global video streaming site with subtitles.

    Previously a Twitter SRE, National University of Singapore

    Twitter @angadsg,

    Github @angad

  • Elasticsearch - Log Indexing and Searching

    Logstash - Log Ingestion plumbing

    Kibana - Frontend{

  • Metrics vs LoggingMetrics

    Numeric timeseries data

    Actionable

    Counts, Statistical (p90, p99 etc.)

    Scalable cost-effective solutions

    already available

  • Logging

    Useful for debugging

    Catch-all

    Full text searching

    Computationally intensive, harder

    to scale

    Metrics vs LoggingMetrics

    Numeric timeseries data

    Actionable

    Counts, Statistical (p90, p99 etc.)

    Scalable cost-effective solutions

    already available

  • Alerting and Monitoring at Viki

    Deeper level debugging with application logs

    Success Rate Alert for service X

    http://2.bp.blogspot.com/-NYO2lN7WizE/VRSnom4L1JI/AAAAAAAAIlw/-WxvtmM_fg4/s1600/pagerduty_teams_001.pnghttp://p4.zdassets.com/hc/settings_assets/552392/200031699/KNmABOne3dg9WcezFiBPyQ-signalfx_logo_RGB.pnghttp://tiger.towson.edu/~mwilla3/programmer_creattica_full.jpghttps://pbs.twimg.com/media/BK5s6buCcAAoG2I.png

  • Logs Application logs - Stack Traces, Handled Exceptions

    Access Logs - Status codes, URI, HTTP Method at all levels of the stack

    Client Logs - Direct HTTP requests containing log events from client-side

    Javascript or Mobile application (android/ios)

    Standardized log format to JSON - easy to add / remove fields.

    Request tracing through various services using Unique-ID at Load Balancer

  • Log aggregator Log preprocessing

    (Filtering etc.) 3 stage pipeline Input > Filter > Output

    Logstash

  • Log aggregator Log preprocessing

    (Filtering etc.) 3 stage pipeline Input > Filter > Output

    Logstash Elasticsearch Full text searching and

    indexing on top of Apache

    Lucene RESTful web interface Horizontally scalable

  • Log aggregator Log preprocessing

    (Filtering etc.) 3 stage pipeline Input > Filter > Output

    Logstash Elasticsearch Full text searching and

    indexing on top of Apache

    Lucene RESTful web interface Horizontally scalable

    Kibana Frontend Visualizations,

    Dashboards Supports Geo

    visualizations Uses ES REST API

  • Input

    Any Stream

    local file queue tcp, udp twitter etc..

    LogstashFilter

    Mutation

    add/remove field parse as json ruby code parse geoip etc..

    Output

    elasticsearch redis queue file pagerduty etc..

  • Golang program that sits next to log files, lumberjack protocol

    Forwards logs from a file to a logstash server

    Removes the need for a buffer (such as redis, or a queue) for

    logs pending ingestion to logstash.

    Docker container with volume mounted /var/log.

    Configuration stored in Consul.

    Application containers with volume mounted /var/log to

    /var/log/docker//application.log

    Logstash Forwarder

  • Logstash pool with HAProxy4 x logstash machines, 8 cores, 16 GB RAM

    7 x logstash processes per machine, 5 for application logs, 2 for HTTP client logs.

    Fronted by HAProxy for both lumberjack protocol as well as HTTP protocol.

    Easily scalable by adding more machines and spinning up more logstash processes.

  • Application ServiceContainer 1

    Application ServiceContainer 2

    Logstash-Forwarder Container

    Mounted /var/log to/var/log/docker/ on host

    http://frontrangecontainers.com/yahoo_site_admin/assets/images/blue_container.153192929_std.jpghttp://frontrangecontainers.com/yahoo_site_admin/assets/images/blue_container.153192929_std.jpghttp://frontrangecontainers.com/yahoo_site_admin/assets/images/blue_container.153192929_std.jpghttps://upload.wikimedia.org/wikipedia/commons/f/f9/3.5FD_disk.jpghttp://www.haproxy.org/img/logo-med.pnghttp://blog.arungupta.me/wp-content/uploads/2015/07/elasticsearch-logo.png

  • Elasticsearch Hardware12 core, 64GB RAM with RAID 0 - 2 x 3TB 7200rpm disks.

    20 nodes, 20 shards, 3 replicas (with 1 primary).

    Each day ~300GB x 4 copies (3 + 1) ~ 3 months of data on 120TB.

    Average 6k-8k logs per second, peak 25k logs per second.

    https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html

    https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.htmlhttps://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html

  • Elasticsearch Hardware

  • < 30.5 GB Heap - JAVA compressed pointers below 30.5GB heap Sweet spot - 64GB of RAM with half available for Lucene file buffers. SSD or RAID 0 (or multiple path directories similar to RAID 0). If SSD then set I/O scheduler to deadline instead of cfq. RAID0 - no need to worry about disks failing as machines can easily be

    replaced due to multiple copies of data. Disable swap.

    Hardware Tuning

  • 20 days of indexes open based on available memory, rest closed - open on demand

    Field data - cache used while sorting and aggregating data. Circuit breaker - cancels requests which require large memory, prevent OOM,

    http://elasticsearch:9200/_cache/clear if field data is very close to memory limit.

    Shards >= Number of nodes Lucene forceMerge - minor performance improvements for older indexes

    (https://www.elastic.co/guide/en/elasticsearch/client/curator/current/optimize.html)

    Elasticsearch Configuration

    https://www.elastic.co/guide/en/elasticsearch/client/curator/current/optimize.htmlhttps://www.elastic.co/guide/en/elasticsearch/client/curator/current/optimize.htmlhttps://www.elastic.co/guide/en/elasticsearch/client/curator/current/optimize.html

  • Prevent split brain situation to avoid losing data - set minimum number of master eligible nodes to (n/2 + 1)

    Set higher ulimit for elasticsearch process

    Daily cronjob which deletes data older than 90 days, closes indices older than 20 days, optimizes (forceMerge) indices older than 2 days

    And also...

  • Marvel - Official plugin from Elasticsearch

    KOPF - Index management plugin

    CAT APIs - REST APIs to view cluster information

    Curator - Data management

    Monitoring

  • Thanksemail: angad@viki.com

    twitter: @angadsg

    mailto:angad@viki.com

Recommended

View more >