elk stack at weibo.com

73
real-time log search & analysis [email protected]

Upload: -

Post on 16-Apr-2017

2.392 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: ELK stack at weibo.com

real-time log search & analysis

[email protected]

Page 2: ELK stack at weibo.com

about me• Perler, SA @ weibo.com, renren.com,

china.com...

• Writer of 《网站运维技术与实践》• Translator of 《 Puppet 3 Cookbook 》• weibo account : @ARGV

Page 3: ELK stack at weibo.com

agenda

• ELKstack situation

• ELKstack usecase

• from ELK to ERK

• performance tuning of LERK

Page 4: ELK stack at weibo.com

ERK situation• datanode * 26:

• 2.4Ghz*8, 42G, 300G *10 RAID5

• logtype * 25 , 7days , 65 billion events , 60k fields

• size 8TB /day , indexing 190k eps

• rsyslog/logstash * 10

• custom plugins of rsyslog/logstash/kibana

• user : qa team, app/server dev team, are team

• ops : ME*0.8

Page 5: ELK stack at weibo.com

kopfstats monitor & setting modify

Page 6: ELK stack at weibo.com

bigdeskreal-time node stats

Page 7: ELK stack at weibo.com

zabbix trappermonitor and alert KPI of ELK

Page 8: ELK stack at weibo.com

But, Why ELK ?

Page 9: ELK stack at weibo.com

First, what can log do?

• Identify problem• data-driven develop/test/operate

• audit• Laws of Marcus J. Ranum

• Monitor

• Monitoring is the aggregation of health and performance data, events, and relationships delivered via an interface that provides an holistic view of a system's state to better understand and address failure scenarios. @etsy

Page 10: ELK stack at weibo.com

difficulties of LA(1)• timestamp + data = log

• OK, what happened between 23:12 and 23:29 yesterday?

Page 11: ELK stack at weibo.com

difficulties of LA(2)

•text is un-structured data

Page 12: ELK stack at weibo.com

difficulties of LA(2)

•grep/awk only run at single host

Page 13: ELK stack at weibo.com

difficulties of LA(3)

• 格式复杂不方便可视化效果

Page 14: ELK stack at weibo.com

So...• We need a real-

time big-data search platform.

• But, splunk is expensive.

• So, spell OSS pls.

Page 15: ELK stack at weibo.com

ELKstack Beginner

Page 16: ELK stack at weibo.com

Hello World# bin/logstash -e ‘input{stdin{}}output{stdout{codec=>rubydebug}}’

Hello World

{

"message" => "Hello World",

"@version" => "1",

"@timestamp" => "2014-08-07T10:30:59.937Z",

"host" => "raochenlindeMacBook-Air.local",

}

Page 17: ELK stack at weibo.com

How Powerful

•$ ./bin/logstash -e ‘input{generator{count=>100000000000}output{stdout{codec=>dots}}}’ | pv -abt > /dev/null

•15.1MiB 0:02:21 [ 112kiB/s]

Page 18: ELK stack at weibo.com

How scaling

Page 19: ELK stack at weibo.com

Talk is cheap, show me the case!

Page 20: ELK stack at weibo.com

application log by php

Page 21: ELK stack at weibo.com

logstash.conf

Page 22: ELK stack at weibo.com

Kibana3backend dev and ops use to identify the error of

APIs and apps

Page 23: ELK stack at weibo.com

and Kibana4ok, K4 need a pretty color bynow

Page 24: ELK stack at weibo.com

PHP slowlog

Page 25: ELK stack at weibo.com

after multiline codecops use to check php slow function stack within

IDCs and hosts

Page 26: ELK stack at weibo.com

drill-down one host

Page 27: ELK stack at weibo.com

Nginx errorlog

Page 28: ELK stack at weibo.com

grok { match => { "message" => "(?<datetime>\d{4}/\d\d/\d\d \d\d:\d\d:\d\d) \[(?<errtype>\w+)\] \S+: \*\d+ (?<errmsg>[^,]+), (?<errinfo>.*)$" } } mutate { gsub => [ "errmsg", "too large body: \d+ bytes", "too large body" ] } if [errinfo] { ruby { code => "event.append(Hash[event['errinfo'].split(', ').map{|l| l.split(': ')}])" } } grok { match => { "request" => '"%{WORD:verb} %{URIPATH:urlpath}(?:\?%{NGX_URIPARAM:urlparam})?(?: HTTP/%{NUMBER:httpversion})"' } } kv { prefix => "url_" source => "urlparam" field_split => "&" } date { locale => 'en' match => [ "datetime", "yyyy/MM/dd HH:mm:ss" ] }

Page 29: ELK stack at weibo.com

performance tuning and troubleshooting based on multi dimensions reports

Page 30: ELK stack at weibo.com

difference tops in another time range

Page 31: ELK stack at weibo.com

app crashapp dev focus on crash stacks which system

functions were filtered out. 。

Page 32: ELK stack at weibo.com

New release, Ad-hoc filter, Focus crash

Page 33: ELK stack at weibo.com

Query helper for QA and NOC, decease MTTI for complaint

Page 34: ELK stack at weibo.com

H5 devs focus on the performance timeline of index.html

Page 35: ELK stack at weibo.com

probability distribution of response time

no more average, no more guess

Page 36: ELK stack at weibo.com

from ELK to ERK

Page 37: ELK stack at weibo.com

someone's children😲

Page 38: ELK stack at weibo.com

My Poor Child😄

Page 39: ELK stack at weibo.com

WHY?

Page 40: ELK stack at weibo.com

comparelogstash

• Design : multithreads + SizedQueue

• Lang : JRuby• Syntax : DSL• ENV : jre1.7• Queue : rely on external system• regexp : ruby• output : java to ES• plugin : 182• monitor : NO!

rsyslog• multithreads + mainQ• C• rainerscript• within rhel6• async queue• ERE• HTTP to ES• 57• pstats

Page 41: ELK stack at weibo.com

problem of Logstash• poor performance of Input/syslog, use input/tcp+filter/grok;

• poor performance of Filter/geoip, had developed filter/geoip2

• high CPU cost by Filter/grok, use filter/ruby with split by myself

• OOM in Input/tcp(prior 1.4.2)

• OOM in Output/elasticsearch(prior 1.5.0)

• retry in Output/elasticsearch repeat with SizedQueue in stud(bynow)

Page 42: ELK stack at weibo.com

problem of LogStash(1)

• LogStash::Inputs::Syslog

• logstash pipeline :• input thread

-> filterworker threads * Num -> output thread

• But What's in Inputs::Syslog :• TCPServer/accept

-> client thread -> filter/grok -> filter/date -> filterworker threads

• We need to do grok and date in only one thread!

• Pure TCPServer can processing 50k qps, but 6k after filter/grok, and then 700 after filter/date!

Page 43: ELK stack at weibo.com

problem of LogStash(1)

• LogStash::Inputs::Syslog

• Solution:

input { tcp { port => 514 }}filter { grok { match => ["message", "%{SYSLOGLINE}"] } syslog_pri { } date { match => ["timestamp", "ISO8601"] }}

• 30k eps in `logstash -w 20` testing.

Page 44: ELK stack at weibo.com

problem LogStash(2)• LogStash::Filters::Grok

• What's Grok:

• pre-define : NUMBER \d+use %{NUMBER:score} instead (?<score>\d+)

• regexp cost LOTS of CPU.

Page 45: ELK stack at weibo.com

problem of LogStash(2)

• LogStash::Filters::Grok

• solution:

• aviod grok, if you can define a separator to your log format:filter { ruby { init => "@kname =

['datetime','uid','limittype','limitkey','client','clientip','request_time','url']" code => "event.append(Hash[@kname.zip(event['message'].split('|'))])" } mutate { convert => ["request_time", "float"] }}

• Result: cpu utils reduce about 20%

Page 46: ELK stack at weibo.com

problem of LogStash(3)

• LogStash::Filters::GeoIP

• 7k eps, even if `logstash -w 30`

• The new MaxMindDB format has a great performance improvement. But LogStash can't distribute it for some license reason.

Page 47: ELK stack at weibo.com

problem of LogStash(3)

• LogStash::Filters::GeoIP

• solution:

• use MaxMind::DB::Writer, change the internal ip.db into ip.mmdb, 300MB->50MB

• JRuby can java_import maxminddb-java.

• 28k eps with LogStash::Filters::MaxMindDB

Page 48: ELK stack at weibo.com

problem of LogStash(4)

• LogStash::Outputs::Elasticsearch

• 3 bugs bynow :1. OOM in logstash1.4.2(ftw-0.0.39)

2. retry by Manticore(logstash1.5.0beta1) was repeat with stud in pipeline, would cause an infinite loop of resending

3. logstash1.5.0rc1 can't record the 429 code, who knows the"got response of . source:" mean?

• 1 and 3 were solved in the newest logstash1.5.0rc3.

Page 49: ELK stack at weibo.com

problem of LogStash(5)

• LogStash::Pipeline

• no supervisor for filterworkers. If all filter workers exception, logstash was blocking but long live!

• If you use filter/ruby to reference `event['field']` as I introduced before, check the field first!if [url] { ruby { code => "event['urlpath']=event['url'].split('?')[0]" }}

Page 50: ELK stack at weibo.com

problem of LogStash(6)

• LogStash::Pipeline

• new event would go through the rest filter after `yield`, but just to output thread(prior logstash1.5.0).

• yield was used in filter-split, filter-clone

Page 51: ELK stack at weibo.com

Rsyslog tuning• action with linkedlist• imfile with an appropriate statepresistinterval(avoid too many

duplication after restart)• omfwd with a small rebindinterval(when target with LVS)• an appropriate global.maxmessagesize• an appropriate queue.size and queue.highwatermask• recommended CEE log format, using with mmjsonparse• separator log format can be processing with mmfields• make the best use of rainerscript• concat JSON strings with property replacer• developed a rsyslog-mmdblookup for ip lookup

Page 52: ELK stack at weibo.com

problem of rsyslog(1)• I find an experimental `foreach` in rsyslog8.7, great! but

when I process my JSON array logs from apps, there are 3 bugs:

1. foreach don't judge the type of parameters;

2.action() don't copy msg but ref. If you omfwd each item in foreach, crash...The test-suite only use omfile which is synchronous.

3.omelasticsearch has an uninitialized variable when enabled errorfile option.

There will be a new copymsg option of action() in rsyslog8.10, suppose to publish at May 20.

Page 53: ELK stack at weibo.com

problem of rsyslog(2)• Not so many message modification plugins.

• mmexternal could fork too many subprocess in v8(but not in v7). And the process speed is 2k eps!

• We had finished a new rsyslog-mmdblookup plugin, would run in production env in May 15.

Page 54: ELK stack at weibo.com

input( type=“imtcp” port=“514” )template( name=“clientlog" type="list" ) { constant(value="{\"@timestamp\":\"") property(name="timereported" dateFormat="rfc3339") constant(value="\",\"host\":\"") property(name="hostname") constant(value="\",\“mmdb\":") property(name="!iplocation") constant(value=",") property(name="$.line" position.from="2")}ruleset( name=“clientlog” ) { action(type="mmjsonparse") if($parsesuccess == "OK") then { foreach ($.line in $!msgarray) { if($.line!rtt == “-”) then { set $.line!rtt = 0; } set $.line!urlpath = field($.line!url, 63, 1); set $.line!urlargs = field($.line!url, 63, 2); set $.line!from = ""; if ( $.line!urlargs != "***FIELD NOT FOUND***" ) then { reset $.line!from = re_extract($.line!urlargs, "from=([0-9]+)", 0, 1, ""); } else { unset $.line!urlargs; } action(type=“mmdb” key=“.line!clientip” fields=[“city”,“isp”,“country”] mmdbfile="./ip.mmdb") action(type="omelasticsearch" server=“1.1.1.1“ bulkmode=“on“ template=“clientlog” queue.size="10000" queue.dequeuebatchsize="2000“ ) } }}if ($programname startswith “mweibo_client”) then { call clientlog stop}

Page 55: ELK stack at weibo.com

ES tuning•DO NOT believe the articles online!!•DO testing use your own dataset, start from one node, one index, one shard, zero replica.• use unicast with a bigger fd.ping_timeout•doc_values, doc_values, doc_values!!!• increase the sets of gateway, recovery and allocation• increase refresh_interval and flush_threshold_size• increase store.throttle.max_bytes_per_sec• upgrade to 1.5.1 at least• scale: use max_shards_per_node• use bulk! no multithreads client, no async•use curator for _optimize• no _all for fixed format log

Page 56: ELK stack at weibo.com

problem of ES(1)

• OOM:

• Kibana3 use facet_filter, which means lots of hits in QUERY phase.

• There is circuit breaker in new version. So you may watch the following errors:

Data too large, data for field [@timestamp] would be larger than limit of[639015321/609.4mb]]

Page 57: ELK stack at weibo.com

problem of ES(1)• OOM:

• solution:

• doc_values,doc_values,doc_values!

• No more heap needed, 31GB is enough.

Page 58: ELK stack at weibo.com

ES 稳定性问题 (2)• long long down time when relocation and recovery.

• default strategy:

• recovery immediately after restart

• only one shard relocation one time

• limit 20MB

• replica need to copy all files from primary shard!

Page 59: ELK stack at weibo.com

ES 稳定性问题 (2)• long long down time when relocation and recovery.

• solution:

• gateway.*: recovery after cluster has enough nodes

• cluster.routing.allocation.*: larger concurrent

• indices.recovery.*: larger limit

• red to yellow: 20 min for full restart.

• Note: there is a bug may cause the recovery process blocking in translog phase.(prior 1.5.1)

Page 60: ELK stack at weibo.com

problem of ES(3)• new nodes die.

• default strategy of shard allocation:

• try to balance the total shards number per node.

• no new shard if over 90% disk.

• The second day of scaling, all new shards would be allocated to the new node! That mean all indexing load.

Page 61: ELK stack at weibo.com

ES 稳定性问题 (3)• new nodes die.

• solution:

1. finish relocation before the creation of next new index.

2. set index.routing.allocation.total_shards_per_node

• note1: pls set a little larger value, in case of recovery for fault...

• note2: DO NOT set this to old indices, your new node is busy now.

Page 62: ELK stack at weibo.com

problem of ES(4)• async replica

• cpu util% would be rising violently if one segment has some deviation, async do NOT validate the indexing data.

• ES will delete such async parameter.

Page 63: ELK stack at weibo.com

ES performance(1)

• 429, 429, 429...

• length of one "client_net_fatal_error" logline may target than 1MB.

• the max HTTP body of ES is 100MB. Be careful with bulk_size.

Page 64: ELK stack at weibo.com

ES performance(2)• index size is several times larger than raw message size.

• _source: raw JSON

• _all: terms in every fields, for full text searching

• multi-field: .raw for all fields in logstash template

• So:

• no _all for nginx accesslog.

• no _source for metrics tsdb log.

• now analyzed fields for most fields, only analyzed for raw message.

Page 65: ELK stack at weibo.com

ES performance(3)

• always CPU utils% for segment merge(hot threads forever).

• max segment: 5GB

• min segment: 2MB

• increase: refresh(1s)/flush(200MB)_interval 。

Page 66: ELK stack at weibo.com

cluster.name: es1003cluster.routing.allocation.node_initial_primaries_recoveries: 30cluster.routing.allocation.node_concurrent_recoveries: 5cluster.routing.allocation.cluster_concurrent_rebalance: 5cluster.routing.allocation.enable: allnode.name: esnode001node.master: falsenode.data: datanode.max_local_storage_nodes: 1index.routing.allocation.total_shards_per_node : 3index.merge.scheduler.max_thread_count: 1index.refresh_interval: 30sindex.number_of_shards: 26index.number_of_replicas: 1index.translog.flush_threshold_size : 5000mbindex.translog.flush_threshold_ops: 50000index.search.slowlog.threshold.query.warn: 30sindex.search.slowlog.threshold.fetch.warn: 1sindex.indexing.slowlog.threshold.index.warn: 10sindices.store.throttle.max_bytes_per_sec: 1000mbindices.cache.filter.size: 10%indices.fielddata.cache.size: 10%indices.recovery.max_bytes_per_sec: 2gbindices.recovery.concurrent_streams: 30path.data: /data1/elasticsearch/datapath.logs: /data1/elasticsearch/logsbootstrap.mlockall: truehttp.max_content_length: 400mbhttp.enabled: truehttp.cors.enabled: truehttp.cors.allow-origin: "*"gateway.type: localgateway.recover_after_nodes: 30gateway.recover_after_time: 5mgateway.expected_nodes: 30discovery.zen.minimum_master_nodes: 3discovery.zen.ping.timeout: 100sdiscovery.zen.ping.multicast.enabled: falsediscovery.zen.ping.unicast.hosts: ["10.19.0.97","10.19.0.98","10.19.0.99"]monitor.jvm.gc.young.warn: 1000msmonitor.jvm.gc.old.warn: 10smonitor.jvm.gc.old.info: 5smonitor.jvm.gc.old.debug: 2s

Page 67: ELK stack at weibo.com

problem of ES(1)• different result in search and store:curl es.domain.com:9200/logstash-accesslog-2015.04.03/nginx/_search?q=_id:AUx-QvSBS-dhpiB8_1f1\&pretty -d '{ "fields": ["requestTime"], "script_fields" : { "test1" : { "script" : "doc[\"requestTime\"].value" }, "test2" : { "script" : "_source.requestTime" }, "test3" : { "script" : "doc[\"requestTime\"].value * 1000" } }}'

Page 68: ELK stack at weibo.com

NOT schema free! "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "logstash-accesslog-2015.04.03", "_type" : "nginx", "_id" : "AUx-QvSBS-dhpiB8_1f1", "_score" : 1.0, "fields" : { "test1" : [ 4603039107142836552 ], "test3" : [ -8646911284551352000 ], "requestTime" : [ 0.54 ], "test2" : [ 0.54 ], } } ] }

Page 69: ELK stack at weibo.com

problem of ES(2)• some data can't be found!

• ES need the same mapping type with the same field name in the same _type of same index.

• My "client_net_fatal_error" log data was changed after one release:

• {"reqhdr":{"Host":"api.weibo.cn"}}

• {"reqhdr":"{\"Host\":\"api.weibo.cn\"}"}

• Set the mapping of "reqhdr" object to {"enabled":false}. the string can only be watched in _sourceJSON, but not searched.

Page 70: ELK stack at weibo.com

problem of ES(3)•some data can't be found! Again!

•There was a default setting `ignore_above:256` in logstash template.

curl 10.19.0.100:9200/logstash-mweibo-2015.05.18/mweibo_client_crash/_search?q=_id:AU1ltyTCQC8tD04iYBIe\&pretty -d '{

"fielddata_fields" : ["jsoncontent.content", "jsoncontent.platform"], "fields" : ["jsoncontent.content","jsoncontent.platform"]}'... "fields" : { "jsoncontent.content" : [ "dalvik.system.NativeStart.main(Native Method)\nCaused by: java.lang.ClassNotFoundException: Didn't find class \"com.sina.weibo.hc.tracking.manager.TrackingService\" on path: DexPathList[[zip file \"/data/app/com.sina.weibo-1.apk\", zip file \"/data/data/com.sina.weibo/code_cache/secondary-dexes/com.sina.weibo-1.apk.classes2.zip\", zip file \"/data/data/com.sina.weibo/app_dex/dbcf1705b9ffbc30ec98d1a76ada120909.jar\"],nativeLibraryDirectories=[/data/app-lib/com.sina.weibo-1, /vendor/lib, /system/lib]]" ],

"jsoncontent.platform" : [ "Android_4.4.4_MX4 Pro_Weibo_5.3.0 Beta_WIFI", "Android_4.4.4_MX4 Pro_Weibo_5.3.0 Beta_WIFI" ]

}

Page 71: ELK stack at weibo.com

kibana custom develop• upgrade the elastic.js version in K3 to support the API of

ES1.2. Then we can use aggs API to implement new panels(percentile panel, range panel, and cardinality histogram panel).

• "export as csv" for table panel.

• map provider setting for bettermap.

• term_stats for map.

• china map.

• query helper.

• script field for terms panel.

• OR filtering.

• more in <https://github.com/chenryn/kibana>

Page 72: ELK stack at weibo.com

see also•《 Elasticsearch Server(2 edition) 》•《 Logging and Log Management the Authoritative Guide to Understanding the Concepts Surrounding Logging and Log Management 》

•《 Data Analysis with Open Source Tools 》•《 Web Operations: Keeping the data on time 》•《 The Art of Capacity Planning 》•《大规模 Web 服务开发技术》•https://codeascraft.com/

•http://calendar.perfplanet.com

•http://kibana.logstash.es

Page 73: ELK stack at weibo.com

[email protected]

“If a newbie has a bad time, it's a bug.”