Scaling an ELK stack at bol.com
Post on 25-Jan-2015
DESCRIPTIONA presentation about the deployment of an ELK stack at bol.com At bol.com we use Elasticsearch, Logstash and Kibana in a logsearch system that allows our developers and operations people to easilly access and search thru logevents coming from all layers of its infrastructure. The presentations explains the initial design and its failures. It continues with explaining the latest design (mid 2014). Its improvements. And finally a set of tips are giving regarding Logstash and Elasticsearch scaling. These slides were first presented at the Elasticsearch NL meetup on September 22nd 2014 at the Utrecht bol.com HQ.
- 1. Scaling an ELK stackElasticsearch NL meetup2014.09.22, Utrecht
2. 1Who am I?Renzo Tom IT operations Linux engineer Python developer Likes huge streams of raw data Designed metrics & logsearch platform Married, proud father of twoAnd you? 3. 2ELK 4. 3ELK at bol.comLogsearch platform.For developers & operations.Search & analyze log events using Kibana.Events from many sources (e.g. syslog, accesslog, log4j, )Part of our infrastructure.Why? Faster root cause analyses quicker time-to-repair. 5. 4Real world examplesCase: release of new webshop version.Nagios alert: jboss processing time.Metrics: increase in active threads (and proctime).=> Inconclusive!Find all HTTP requests to www.bol.com which were slowerthan 5 seconds:@type:apache_access AND @fields.site:www_bol_com AND @fields.responsetimes:[5.000.000 TO *]=> Hits for 1 URL. Enough for DEV to start its RCA. 6. 5Real world examplesCase: strange performance spikes on webshop.Looks bad, but cause unknown.Find all errors in webshop log4j logging:@fields.application:wsp AND @fields.level:ERRORCompare errors before vs during spike. Spot the difference.=> Spikes caused by timeouts on a backend service.Metrics correlation: timeouts not cause, but symptom of fullGC issue. 7. Initial design (mid 2013ish)6Kibana2Servers, routers, firewalls Remote_syslogpkgLog4jsyslogappenderLogstashElasticElassetaicrc hsearchSyslogLogeventsActs as syslog server.Converts linesinto events,into json docs.AccesslogCentralsyslogserverApache webserversJava webapplications (JVM)Using syslog protocolover UDP as transport.Even for accesslog + log4j.tail 8. 7Initial attempt #failSingle logstash instance not fast enough.Unable to keep up with events created.High CPU load, due to intensive grokking (regex).Network buffer overflow. UDP traffic dropped.Result: missing events. 9. 8Initial attempt #failLog4j events can be multiline (e.g. stacktraces).Events are send per line:100 lines = 100 syslog msgsMerging by Logstash.Remember the UDP drops?Result:- unparseable events (if 1st line was missing)- Swiss cheese. Stacktrace lines were missing. 10. 9Initial attempt #failSyslog RFC3164:The total length of the packet MUST be 1024 bytes orless.Rich Apache LogFormat + lots of cookies = 4kb easily.Anything after byte 1024 got trimmed.Result: unparseable events (mismatch grok pattern) 11. 10The only way is up.Improvement proposals:- Use queuing to make Logstash horizontalscalable.- Drop syslog as transport (for non-syslog).- Reduce amount of grokking. Pre-formatting atsource scales better. Less complexity. 12. Latest design (mid 2014ish)Lots of Many instancesothersources11Kibana2 + 3Servers, routers, firewalls LocalLogsheepLog4jjsoneventlayoutElasticElassetaicrc hsearchSyslogAccesslogjsoneventformatLogeventsCentralsyslogserverApache webserversJava webapplications (JVM)ElasticResdeaisrch(queue)Log4jredisappenderLogstashLocalLogsheepEvents in jsonevent format.No grokking required. 13. 12Current status #win- Logstash: up to 10 instances per env (because of logstash 1.1 version)- ES cluster (v1.0.1): 6 data + 2 client nodes- Each datanode has 7 datadisks (striping)- Indexing at 2k 4k docs added per second- Avg. index time: 0.5ms- Peak: 300M docs = 185GB, per day- Searches: just a few per hour- Shardcount: 3 per idx, 1 replica, 3000 total- Retention: up to 60 days 14. 13Our lessons learnedBefore anything else!Start collecting metrics so you get a baseline.No blind tuning. Validate every change fact-based.Our weapons of choice: Graphite Diamond (I am contributor of the ES collector) JcollectdAlternative: try Marvel. 15. 14Logstash tip #1Insert Redis as queue between source andlogstash instances:- Scale Logstash scale horizontally- High availability (no events get lost)RedisLogstashLogstashLogstashRedis 16. 15Logstash tip #2Tune your workers. Find your chokepoint andincrease its workers to improve throughput.Input Filter OutputFilterInput OutputFilter$ top H p $(pgrep logstash) 17. 16Logstash tip #3Grok is very powerful, but CPU intensive. Hard towrite, maintain and debug.Fix: vertical scaling. Increase filterworkers or addmore Logstash instances.Better: feed Logstash with jsonevent input.Solutions: Log4j: use log4j-jsonevent-layout Apache: define json output with LogFormat 18. 17Logstash tip #4 (last one)Use the HTTP protocol Elasticsearch output.Avoid a version lock in!HTTP may be slower, but newer ES means:- Lots of new features- Lots of bug fixes- Lots of performance improvementsMost important: you decide what versions to use.Logstash v1.4.2 (June 14) requires ES v1.1.1 (April 14).Latest ES version is v1.3.2 (Aug 14). 19. 18Elasticsearch tip #1Do not download a great configuration.Elasticsearch is very complex. Lots of moving parts.Lots of different use-cases. Lots of configurationoptions. The defaults can not be optimal.Start with defaults: Load it (stresstest or pre-launch traffic). Check your metrics. Find your chokepoint. Change setting. Verify and repeat. 20. 19Elasticsearch tip #2Increase the index.refresh_interval setting.Refresh: make newly added docs available forsearch. Default value: one second. High impacton heavy indexing systems (like ours).Change it at runtime & check the metrics:$ curl -s -XPUT 0:9200/_all/_settings?index.refresh_interval=5s 21. 20Elasticsearch tip #3Use Curator to keep total shardcount constant.Uncontrolled shard growth may trigger a suddenhockey stick effect.Our setup:- 6 datanodes- 6 shards per index- 3 primary, 3 replicaOne shard per datanode (YMMV) 22. 21Elasticsearch tip #4Become experienced in rolling cluster restarts:- to roll out new Elasticsearch releases- to apply a config setting (e.g. heap, gc, ..)- because it will solve an incident.Control concurrency + bandwidth:cluster.routing.allocation.node_concurrent_recoveriescluster.routing.allocation.cluster_concurrent_rebalanceindices.recovery.max_bytes_per_secGet confident enough to trustdoing a rolling restart on aSaturday evening!(To get this graph ) 23. 22Elasticsearch tip #5 (last one)Cluster restarts improve recovery time.Recovery: compares replica vs primary shard. Ifdifferent, recreate the replica. Costly (iowait) andvery time consuming.But difference is normal. Primary and replicahave their own segment merge management:same docs, but different bytes.After recovery: replica is exact copy of primary.Note: only works for stale shards (no more updates).You have a lot of those when using daily Logstash indices. 24. You can contact me via:email@example.com, or 25. 24 26. Relocation in action 27. 26Tools we usehttp://redis.io/Key/value memory store, no-frills queuing, extremely fast.Used to scale logstash horizontally.https://github.com/emicklei/log4j-redis-appenderSend log4j event to Redis queue, non-blocking, batch, failoverhttps://github.com/emicklei/log4j-jsonevent-layoutFormat log4j events in logstash event layout.Why have logstash do lots of grokking, if you can feed it with logstash friendly json.http://untergeek.com/2013/09/11/getting-apache-to-output-json-for-logstash-1-2-x/Format Apache access logging in logstash event layout. Again: avoid grokking.https://github.com/bolcom/ (SOON)Logsheep: custom multi-threaded logtailer / udp listener, sends events to redis.https://github.com/BrightcoveOS/Diamond/Great metrics collector framework with Elasticsearch collector. I am contributor.https://github.com/elasticsearch/curatorTool for automatic Elasticsearch index management (delete, close, optimize, bloom).