Scaling ELK Stack - DevOpsDays Singapore

Download Scaling ELK Stack - DevOpsDays Singapore

Post on 17-Feb-2017

385 views

Category:

Internet

4 download

TRANSCRIPT

<ul><li><p>ELKLog processing at Scale</p><p>#DevOpsDays 2015, Singapore@DevOpsDaysSG</p><p>Angad Singh</p></li><li><p>About meDevOps at Viki, Inc - A global video streaming site with subtitles.</p><p>Previously a Twitter SRE, National University of Singapore</p><p>Twitter @angadsg, </p><p>Github @angad</p></li><li><p>Elasticsearch - Log Indexing and Searching</p><p>Logstash - Log Ingestion plumbing</p><p>Kibana - Frontend{</p></li><li><p>Metrics vs LoggingMetrics</p><p> Numeric timeseries data</p><p> Actionable</p><p> Counts, Statistical (p90, p99 etc.)</p><p> Scalable cost-effective solutions </p><p>already available</p></li><li><p>Logging</p><p> Useful for debugging</p><p> Catch-all</p><p> Full text searching</p><p> Computationally intensive, harder </p><p>to scale</p><p>Metrics vs LoggingMetrics</p><p> Numeric timeseries data</p><p> Actionable</p><p> Counts, Statistical (p90, p99 etc.)</p><p> Scalable cost-effective solutions </p><p>already available</p></li><li><p>Alerting and Monitoring at Viki</p><p>Deeper level debugging with application logs</p><p>Success Rate Alert for service X</p><p>http://2.bp.blogspot.com/-NYO2lN7WizE/VRSnom4L1JI/AAAAAAAAIlw/-WxvtmM_fg4/s1600/pagerduty_teams_001.pnghttp://p4.zdassets.com/hc/settings_assets/552392/200031699/KNmABOne3dg9WcezFiBPyQ-signalfx_logo_RGB.pnghttp://tiger.towson.edu/~mwilla3/programmer_creattica_full.jpghttps://pbs.twimg.com/media/BK5s6buCcAAoG2I.png</p></li><li><p>Logs Application logs - Stack Traces, Handled Exceptions</p><p> Access Logs - Status codes, URI, HTTP Method at all levels of the stack</p><p> Client Logs - Direct HTTP requests containing log events from client-side </p><p>Javascript or Mobile application (android/ios)</p><p> Standardized log format to JSON - easy to add / remove fields.</p><p> Request tracing through various services using Unique-ID at Load Balancer</p></li><li><p> Log aggregator Log preprocessing </p><p>(Filtering etc.) 3 stage pipeline Input &gt; Filter &gt; Output</p><p>Logstash</p></li><li><p> Log aggregator Log preprocessing </p><p>(Filtering etc.) 3 stage pipeline Input &gt; Filter &gt; Output</p><p>Logstash Elasticsearch Full text searching and </p><p>indexing on top of Apache </p><p>Lucene RESTful web interface Horizontally scalable</p></li><li><p> Log aggregator Log preprocessing </p><p>(Filtering etc.) 3 stage pipeline Input &gt; Filter &gt; Output</p><p>Logstash Elasticsearch Full text searching and </p><p>indexing on top of Apache </p><p>Lucene RESTful web interface Horizontally scalable</p><p>Kibana Frontend Visualizations, </p><p>Dashboards Supports Geo </p><p>visualizations Uses ES REST API</p></li><li><p>Input </p><p>Any Stream</p><p> local file queue tcp, udp twitter etc..</p><p>LogstashFilter</p><p>Mutation</p><p> add/remove field parse as json ruby code parse geoip etc..</p><p>Output</p><p> elasticsearch redis queue file pagerduty etc..</p></li><li><p> Golang program that sits next to log files, lumberjack protocol</p><p> Forwards logs from a file to a logstash server</p><p> Removes the need for a buffer (such as redis, or a queue) for </p><p>logs pending ingestion to logstash.</p><p> Docker container with volume mounted /var/log. </p><p>Configuration stored in Consul.</p><p> Application containers with volume mounted /var/log to </p><p>/var/log/docker//application.log</p><p>Logstash Forwarder</p></li><li><p>Logstash pool with HAProxy4 x logstash machines, 8 cores, 16 GB RAM</p><p>7 x logstash processes per machine, 5 for application logs, 2 for HTTP client logs.</p><p>Fronted by HAProxy for both lumberjack protocol as well as HTTP protocol.</p><p>Easily scalable by adding more machines and spinning up more logstash processes.</p></li><li><p>Application ServiceContainer 1</p><p>Application ServiceContainer 2</p><p>Logstash-Forwarder Container</p><p>Mounted /var/log to/var/log/docker/ on host</p><p>http://frontrangecontainers.com/yahoo_site_admin/assets/images/blue_container.153192929_std.jpghttp://frontrangecontainers.com/yahoo_site_admin/assets/images/blue_container.153192929_std.jpghttp://frontrangecontainers.com/yahoo_site_admin/assets/images/blue_container.153192929_std.jpghttps://upload.wikimedia.org/wikipedia/commons/f/f9/3.5FD_disk.jpghttp://www.haproxy.org/img/logo-med.pnghttp://blog.arungupta.me/wp-content/uploads/2015/07/elasticsearch-logo.png</p></li><li><p>Elasticsearch Hardware12 core, 64GB RAM with RAID 0 - 2 x 3TB 7200rpm disks.</p><p>20 nodes, 20 shards, 3 replicas (with 1 primary).</p><p>Each day ~300GB x 4 copies (3 + 1) ~ 3 months of data on 120TB.</p><p>Average 6k-8k logs per second, peak 25k logs per second.</p><p>https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html </p><p>https://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.htmlhttps://www.elastic.co/guide/en/elasticsearch/guide/current/hardware.html</p></li><li><p>Elasticsearch Hardware</p></li><li><p> &lt; 30.5 GB Heap - JAVA compressed pointers below 30.5GB heap Sweet spot - 64GB of RAM with half available for Lucene file buffers. SSD or RAID 0 (or multiple path directories similar to RAID 0). If SSD then set I/O scheduler to deadline instead of cfq. RAID0 - no need to worry about disks failing as machines can easily be </p><p>replaced due to multiple copies of data. Disable swap.</p><p>Hardware Tuning</p></li><li><p> 20 days of indexes open based on available memory, rest closed - open on demand</p><p> Field data - cache used while sorting and aggregating data. Circuit breaker - cancels requests which require large memory, prevent OOM, </p><p>http://elasticsearch:9200/_cache/clear if field data is very close to memory limit.</p><p> Shards &gt;= Number of nodes Lucene forceMerge - minor performance improvements for older indexes </p><p>(https://www.elastic.co/guide/en/elasticsearch/client/curator/current/optimize.html)</p><p>Elasticsearch Configuration</p><p>https://www.elastic.co/guide/en/elasticsearch/client/curator/current/optimize.htmlhttps://www.elastic.co/guide/en/elasticsearch/client/curator/current/optimize.htmlhttps://www.elastic.co/guide/en/elasticsearch/client/curator/current/optimize.html</p></li><li><p>Prevent split brain situation to avoid losing data - set minimum number of master eligible nodes to (n/2 + 1)</p><p>Set higher ulimit for elasticsearch process</p><p>Daily cronjob which deletes data older than 90 days, closes indices older than 20 days, optimizes (forceMerge) indices older than 2 days</p><p>And also...</p></li><li><p>Marvel - Official plugin from Elasticsearch</p><p>KOPF - Index management plugin</p><p>CAT APIs - REST APIs to view cluster information</p><p>Curator - Data management</p><p>Monitoring</p></li><li><p>Thanksemail: angad@viki.com</p><p>twitter: @angadsg</p><p>mailto:angad@viki.com</p></li></ul>