elk stack at weibo.com

real-time log search & analysis

[email protected]

about me• Perler, SA @ weibo.com, renren.com,

china.com...

• Writer of 《网站运维技术与实践》• Translator of 《 Puppet 3 Cookbook 》• weibo account ： @ARGV

agenda

• ELKstack situation

• ELKstack usecase

• from ELK to ERK

• performance tuning of LERK

ERK situation• datanode * 26:

• 2.4Ghz*8, 42G, 300G *10 RAID5

• logtype * 25 ， 7days ， 65 billion events ， 60k fields

• size 8TB /day ， indexing 190k eps

• rsyslog/logstash * 10

• custom plugins of rsyslog/logstash/kibana

• user ： qa team, app/server dev team, are team

• ops ： ME*0.8

kopfstats monitor & setting modify

bigdeskreal-time node stats

zabbix trappermonitor and alert KPI of ELK

But, Why ELK ？

First, what can log do?

• Identify problem• data-driven develop/test/operate

• audit• Laws of Marcus J. Ranum

• Monitor

• Monitoring is the aggregation of health and performance data, events, and relationships delivered via an interface that provides an holistic view of a system's state to better understand and address failure scenarios. @etsy

difficulties of LA(1)• timestamp + data = log

• OK, what happened between 23:12 and 23:29 yesterday?

difficulties of LA(2)

•text is un-structured data


•grep/awk only run at single host


• 格式复杂不方便可视化效果

So...• We need a real-

time big-data search platform.

• But, splunk is expensive.

• So, spell OSS pls.

ELKstack Beginner

Hello World# bin/logstash -e ‘input{stdin{}}output{stdout{codec=>rubydebug}}’

Hello World

{

"message" => "Hello World",

"@version" => "1",

"@timestamp" => "2014-08-07T10:30:59.937Z",

"host" => "raochenlindeMacBook-Air.local",

}

How Powerful

•$ ./bin/logstash -e ‘input{generator{count=>100000000000}output{stdout{codec=>dots}}}’ | pv -abt > /dev/null

•15.1MiB 0:02:21 [ 112kiB/s]

How scaling

Talk is cheap, show me the case!

application log by php

logstash.conf

Kibana3backend dev and ops use to identify the error of

APIs and apps

and Kibana4ok, K4 need a pretty color bynow

PHP slowlog

after multiline codecops use to check php slow function stack within

IDCs and hosts

drill-down one host

Nginx errorlog

grok { match => { "message" => "(?<datetime>\d{4}/\d\d/\d\d \d\d:\d\d:\d\d) \[(?<errtype>\w+)\] \S+: \*\d+ (?<errmsg>[^,]+), (?<errinfo>.*)$" } } mutate { gsub => [ "errmsg", "too large body: \d+ bytes", "too large body" ] } if [errinfo] { ruby { code => "event.append(Hash[event['errinfo'].split(', ').map{|l| l.split(': ')}])" } } grok { match => { "request" => '"%{WORD:verb} %{URIPATH:urlpath}(?:\?%{NGX_URIPARAM:urlparam})?(?: HTTP/%{NUMBER:httpversion})"' } } kv { prefix => "url_" source => "urlparam" field_split => "&" } date { locale => 'en' match => [ "datetime", "yyyy/MM/dd HH:mm:ss" ] }

performance tuning and troubleshooting based on multi dimensions reports

difference tops in another time range

app crashapp dev focus on crash stacks which system

functions were filtered out. 。

New release, Ad-hoc filter, Focus crash

Query helper for QA and NOC, decease MTTI for complaint

H5 devs focus on the performance timeline of index.html

probability distribution of response time

no more average, no more guess

from ELK to ERK

someone's children😲

My Poor Child😄

comparelogstash

• Design ： multithreads + SizedQueue

• Lang ： JRuby• Syntax ： DSL• ENV ： jre1.7• Queue ： rely on external system• regexp ： ruby• output ： java to ES• plugin ： 182• monitor ： NO!

rsyslog• multithreads + mainQ• C• rainerscript• within rhel6• async queue• ERE• HTTP to ES• 57• pstats

problem of Logstash• poor performance of Input/syslog, use input/tcp+filter/grok;

• poor performance of Filter/geoip, had developed filter/geoip2

• high CPU cost by Filter/grok, use filter/ruby with split by myself

• OOM in Input/tcp(prior 1.4.2)

• OOM in Output/elasticsearch(prior 1.5.0)

• retry in Output/elasticsearch repeat with SizedQueue in stud(bynow)

problem of LogStash(1)

• LogStash::Inputs::Syslog

• logstash pipeline ：• input thread

-> filterworker threads * Num -> output thread

• But What's in Inputs::Syslog ：• TCPServer/accept

-> client thread -> filter/grok -> filter/date -> filterworker threads

• We need to do grok and date in only one thread!

• Pure TCPServer can processing 50k qps, but 6k after filter/grok, and then 700 after filter/date!


• LogStash::Inputs::Syslog

• Solution:

input { tcp { port => 514 }}filter { grok { match => ["message", "%{SYSLOGLINE}"] } syslog_pri { } date { match => ["timestamp", "ISO8601"] }}

• 30k eps in `logstash -w 20` testing.

problem LogStash(2)• LogStash::Filters::Grok

• What's Grok:

• pre-define ： NUMBER \d+use %{NUMBER:score} instead (?<score>\d+)

• regexp cost LOTS of CPU.


• LogStash::Filters::Grok

• solution:

• aviod grok, if you can define a separator to your log format:filter { ruby { init => "@kname =

['datetime','uid','limittype','limitkey','client','clientip','request_time','url']" code => "event.append(Hash[@kname.zip(event['message'].split('|'))])" } mutate { convert => ["request_time", "float"] }}

• Result: cpu utils reduce about 20%


• LogStash::Filters::GeoIP

• 7k eps, even if `logstash -w 30`

• The new MaxMindDB format has a great performance improvement. But LogStash can't distribute it for some license reason.


• LogStash::Filters::GeoIP

• solution:

• use MaxMind::DB::Writer, change the internal ip.db into ip.mmdb, 300MB->50MB

• JRuby can java_import maxminddb-java.

• 28k eps with LogStash::Filters::MaxMindDB


• LogStash::Outputs::Elasticsearch

• 3 bugs bynow ：1. OOM in logstash1.4.2(ftw-0.0.39)

2. retry by Manticore(logstash1.5.0beta1) was repeat with stud in pipeline, would cause an infinite loop of resending

3. logstash1.5.0rc1 can't record the 429 code, who knows the"got response of . source:" mean?

• 1 and 3 were solved in the newest logstash1.5.0rc3.


• LogStash::Pipeline

• no supervisor for filterworkers. If all filter workers exception, logstash was blocking but long live!

• If you use filter/ruby to reference `event['field']` as I introduced before, check the field first!if [url] { ruby { code => "event['urlpath']=event['url'].split('?')[0]" }}


• LogStash::Pipeline

• new event would go through the rest filter after `yield`, but just to output thread(prior logstash1.5.0).

• yield was used in filter-split, filter-clone

Rsyslog tuning• action with linkedlist• imfile with an appropriate statepresistinterval(avoid too many

duplication after restart)• omfwd with a small rebindinterval(when target with LVS)• an appropriate global.maxmessagesize• an appropriate queue.size and queue.highwatermask• recommended CEE log format, using with mmjsonparse• separator log format can be processing with mmfields• make the best use of rainerscript• concat JSON strings with property replacer• developed a rsyslog-mmdblookup for ip lookup

problem of rsyslog(1)• I find an experimental `foreach` in rsyslog8.7, great! but

when I process my JSON array logs from apps, there are 3 bugs:

1. foreach don't judge the type of parameters;

2.action() don't copy msg but ref. If you omfwd each item in foreach, crash...The test-suite only use omfile which is synchronous.

3.omelasticsearch has an uninitialized variable when enabled errorfile option.

There will be a new copymsg option of action() in rsyslog8.10, suppose to publish at May 20.

problem of rsyslog(2)• Not so many message modification plugins.

• mmexternal could fork too many subprocess in v8(but not in v7). And the process speed is 2k eps!

• We had finished a new rsyslog-mmdblookup plugin, would run in production env in May 15.

input( type=“imtcp” port=“514” )template( name=“clientlog" type="list" ) { constant(value="{\"@timestamp\":\"") property(name="timereported" dateFormat="rfc3339") constant(value="\",\"host\":\"") property(name="hostname") constant(value="\",\“mmdb\":") property(name="!iplocation") constant(value=",") property(name="$.line" position.from="2")}ruleset( name=“clientlog” ) { action(type="mmjsonparse") if($parsesuccess == "OK") then { foreach ($.line in $!msgarray) { if($.line!rtt == “-”) then { set $.line!rtt = 0; } set $.line!urlpath = field($.line!url, 63, 1); set $.line!urlargs = field($.line!url, 63, 2); set $.line!from = ""; if ( $.line!urlargs != "***FIELD NOT FOUND***" ) then { reset $.line!from = re_extract($.line!urlargs, "from=([0-9]+)", 0, 1, ""); } else { unset $.line!urlargs; } action(type=“mmdb” key=“.line!clientip” fields=[“city”,“isp”,“country”] mmdbfile="./ip.mmdb") action(type="omelasticsearch" server=“1.1.1.1“ bulkmode=“on“ template=“clientlog” queue.size="10000" queue.dequeuebatchsize="2000“ ) } }}if ($programname startswith “mweibo_client”) then { call clientlog stop}

ES tuning•DO NOT believe the articles online!!•DO testing use your own dataset, start from one node, one index, one shard, zero replica.• use unicast with a bigger fd.ping_timeout•doc_values, doc_values, doc_values!!!• increase the sets of gateway, recovery and allocation• increase refresh_interval and flush_threshold_size• increase store.throttle.max_bytes_per_sec• upgrade to 1.5.1 at least• scale: use max_shards_per_node• use bulk! no multithreads client, no async•use curator for _optimize• no _all for fixed format log

problem of ES(1)

• OOM:

• Kibana3 use facet_filter, which means lots of hits in QUERY phase.

• There is circuit breaker in new version. So you may watch the following errors:

Data too large, data for field [@timestamp] would be larger than limit of[639015321/609.4mb]]

problem of ES(1)• OOM:

• solution:

• doc_values,doc_values,doc_values!

• No more heap needed, 31GB is enough.

ES 稳定性问题 (2)• long long down time when relocation and recovery.

• default strategy:

• recovery immediately after restart

• only one shard relocation one time

• limit 20MB

• replica need to copy all files from primary shard!

ES 稳定性问题 (2)• long long down time when relocation and recovery.

• solution:

• gateway.*: recovery after cluster has enough nodes

• cluster.routing.allocation.*: larger concurrent

• indices.recovery.*: larger limit

• red to yellow: 20 min for full restart.

• Note: there is a bug may cause the recovery process blocking in translog phase.(prior 1.5.1)

problem of ES(3)• new nodes die.

• default strategy of shard allocation:

• try to balance the total shards number per node.

• no new shard if over 90% disk.

• The second day of scaling, all new shards would be allocated to the new node! That mean all indexing load.

ES 稳定性问题 (3)• new nodes die.

• solution:

1. finish relocation before the creation of next new index.

2. set index.routing.allocation.total_shards_per_node

• note1: pls set a little larger value, in case of recovery for fault...

• note2: DO NOT set this to old indices, your new node is busy now.

problem of ES(4)• async replica

• cpu util% would be rising violently if one segment has some deviation, async do NOT validate the indexing data.

• ES will delete such async parameter.

ES performance(1)

• 429, 429, 429...

• length of one "client_net_fatal_error" logline may target than 1MB.

• the max HTTP body of ES is 100MB. Be careful with bulk_size.

ES performance(2)• index size is several times larger than raw message size.

• _source: raw JSON

• _all: terms in every fields, for full text searching

• multi-field: .raw for all fields in logstash template

• So:

• no _all for nginx accesslog.

• no _source for metrics tsdb log.

• now analyzed fields for most fields, only analyzed for raw message.

ES performance(3)

• always CPU utils% for segment merge(hot threads forever).

• max segment: 5GB

• min segment: 2MB

• increase: refresh(1s)/flush(200MB)_interval 。

cluster.name: es1003cluster.routing.allocation.node_initial_primaries_recoveries: 30cluster.routing.allocation.node_concurrent_recoveries: 5cluster.routing.allocation.cluster_concurrent_rebalance: 5cluster.routing.allocation.enable: allnode.name: esnode001node.master: falsenode.data: datanode.max_local_storage_nodes: 1index.routing.allocation.total_shards_per_node : 3index.merge.scheduler.max_thread_count: 1index.refresh_interval: 30sindex.number_of_shards: 26index.number_of_replicas: 1index.translog.flush_threshold_size : 5000mbindex.translog.flush_threshold_ops: 50000index.search.slowlog.threshold.query.warn: 30sindex.search.slowlog.threshold.fetch.warn: 1sindex.indexing.slowlog.threshold.index.warn: 10sindices.store.throttle.max_bytes_per_sec: 1000mbindices.cache.filter.size: 10%indices.fielddata.cache.size: 10%indices.recovery.max_bytes_per_sec: 2gbindices.recovery.concurrent_streams: 30path.data: /data1/elasticsearch/datapath.logs: /data1/elasticsearch/logsbootstrap.mlockall: truehttp.max_content_length: 400mbhttp.enabled: truehttp.cors.enabled: truehttp.cors.allow-origin: "*"gateway.type: localgateway.recover_after_nodes: 30gateway.recover_after_time: 5mgateway.expected_nodes: 30discovery.zen.minimum_master_nodes: 3discovery.zen.ping.timeout: 100sdiscovery.zen.ping.multicast.enabled: falsediscovery.zen.ping.unicast.hosts: ["10.19.0.97","10.19.0.98","10.19.0.99"]monitor.jvm.gc.young.warn: 1000msmonitor.jvm.gc.old.warn: 10smonitor.jvm.gc.old.info: 5smonitor.jvm.gc.old.debug: 2s

problem of ES(1)• different result in search and store:curl es.domain.com:9200/logstash-accesslog-2015.04.03/nginx/_search?q=_id:AUx-QvSBS-dhpiB8_1f1\&pretty -d '{ "fields": ["requestTime"], "script_fields" : { "test1" : { "script" : "doc[\"requestTime\"].value" }, "test2" : { "script" : "_source.requestTime" }, "test3" : { "script" : "doc[\"requestTime\"].value * 1000" } }}'

NOT schema free! "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "logstash-accesslog-2015.04.03", "_type" : "nginx", "_id" : "AUx-QvSBS-dhpiB8_1f1", "_score" : 1.0, "fields" : { "test1" : [ 4603039107142836552 ], "test3" : [ -8646911284551352000 ], "requestTime" : [ 0.54 ], "test2" : [ 0.54 ], } } ] }

problem of ES(2)• some data can't be found!

• ES need the same mapping type with the same field name in the same _type of same index.

• My "client_net_fatal_error" log data was changed after one release:

• {"reqhdr":{"Host":"api.weibo.cn"}}

• {"reqhdr":"{\"Host\":\"api.weibo.cn\"}"}

• Set the mapping of "reqhdr" object to {"enabled":false}. the string can only be watched in _sourceJSON, but not searched.

problem of ES(3)•some data can't be found! Again!

•There was a default setting `ignore_above:256` in logstash template.

curl 10.19.0.100:9200/logstash-mweibo-2015.05.18/mweibo_client_crash/_search?q=_id:AU1ltyTCQC8tD04iYBIe\&pretty -d '{

"fielddata_fields" : ["jsoncontent.content", "jsoncontent.platform"], "fields" : ["jsoncontent.content","jsoncontent.platform"]}'... "fields" : { "jsoncontent.content" : [ "dalvik.system.NativeStart.main(Native Method)\nCaused by: java.lang.ClassNotFoundException: Didn't find class \"com.sina.weibo.hc.tracking.manager.TrackingService\" on path: DexPathList[[zip file \"/data/app/com.sina.weibo-1.apk\", zip file \"/data/data/com.sina.weibo/code_cache/secondary-dexes/com.sina.weibo-1.apk.classes2.zip\", zip file \"/data/data/com.sina.weibo/app_dex/dbcf1705b9ffbc30ec98d1a76ada120909.jar\"],nativeLibraryDirectories=[/data/app-lib/com.sina.weibo-1, /vendor/lib, /system/lib]]" ],

"jsoncontent.platform" : [ "Android_4.4.4_MX4 Pro_Weibo_5.3.0 Beta_WIFI", "Android_4.4.4_MX4 Pro_Weibo_5.3.0 Beta_WIFI" ]

}

kibana custom develop• upgrade the elastic.js version in K3 to support the API of

ES1.2. Then we can use aggs API to implement new panels(percentile panel, range panel, and cardinality histogram panel).

• "export as csv" for table panel.

• map provider setting for bettermap.

• term_stats for map.

• china map.

• query helper.

• script field for terms panel.

• OR filtering.

• more in <https://github.com/chenryn/kibana>

see also•《 Elasticsearch Server(2 edition) 》•《 Logging and Log Management the Authoritative Guide to Understanding the Concepts Surrounding Logging and Log Management 》

•《 Data Analysis with Open Source Tools 》•《 Web Operations: Keeping the data on time 》•《 The Art of Capacity Planning 》•《大规模 Web 服务开发技术》•https://codeascraft.com/

•http://calendar.perfplanet.com

•http://kibana.logstash.es

–[email protected]

“If a newbie has a bad time, it's a bug.”

elk stack at weibo.com

Technology