care and feeding of large scale graphite installations - devopsdays austin 2013

Post on 15-Jan-2015

1.139 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Tip and tricks of using and scaling graphite. First presented at DevOpsDays Austin Texas 2013-05-01

TRANSCRIPT

Nick Galbreath http://client9.com/20130501 @ngalbreath

Care and Feeding ofLarge Scale

Graphite Installations

Nick Galbreath ★ IPONWEBDevOpsDays ★ Austin Texas ★ 2013-04-30

Nick Galbreath http://client9.com/20130501 @ngalbreath

http://client9.com/20130501

Nick Galbreath http://client9.com/20130501 @ngalbreath

Who is nickg?

Nick Galbreath http://client9.com/20130501 @ngalbreath

(that's online advertising infrastructure)

Nick Galbreath http://client9.com/20130501 @ngalbreath

• Over One Billion Points collected daily.

• In "many" independent Graphite clusters.

so, graphite?

Nick Galbreath http://client9.com/20130501 @ngalbreath

Who cares? • Making it easy to create, analyze and

share data can change your organization

• Making a data-driven culture

• Empowering developers, operations, qa, security and business to be more confident in the changes they make.

Nick Galbreath http://client9.com/20130501 @ngalbreath

What is it you say you do here?

• Your job is likely invisible to rest of the organization

• invisible things aren't valued

• so make what you do visible

Nick Galbreath http://client9.com/20130501 @ngalbreath

Why Graphite?

• Many innovations in each part of the stack

• But, it's the Full Stack that really makes it special.

• On-disk layout to UI to API to... the community around it.

Nick Galbreath http://client9.com/20130501 @ngalbreath

Sharing in Caring

• Allows data to be easily accessed

• And easily shared. This makes it different than many monitoring solutions.

• It's your own in-house mashup generator.

What is it?

Nick Galbreath http://client9.com/20130501 @ngalbreath

The Big Picture

• It's a database

• But not ACID

• And without all the database tools

Nick Galbreath http://client9.com/20130501 @ngalbreath

Installation• 3 python-twistd servers

• "carbon-cache"

• "cache-aggregator"

• "carbon-relay"

• Apache / Django Web UI and API

• Uses SQLite3/MySQL for dashboards / events

Nick Galbreath http://client9.com/20130501 @ngalbreath

Be Current

• Don't use the OS default version.

• Newer point releases of graphite have significant improvements in storage engine and webui/api

• It's 100% Python so "building it yourself" shouldn't to hard.

• pip install works and is current.

Nick Galbreath http://client9.com/20130501 @ngalbreath

The Documentation

• everyone complains about it

• historically bad, but getting a lot better

• Switched locations, but not all searchengines are updated to use:http://graphite.readthedocs.org/

• Source code is quite good, so RTFS

Nick Galbreath http://client9.com/20130501 @ngalbreath

Storage Engine

Nick Galbreath http://client9.com/20130501 @ngalbreath

What is it?• Storage engine

• Handles reads, writes and creates of a single metric to a fixed size file.

• One file, kinda dumb (good).

• Here's the API:https://github.com/graphite-project/whisper/blob/master/whisper.py

Nick Galbreath http://client9.com/20130501 @ngalbreath

Graphite Math• About 12 bytes per point.

• Store 1 minute points for 1 month and 15 minutes for 11 months.

• (60×24×30 + 4×24×30×11) ×12 = 878kB

• If you can keep all your points in memory, then magic!

Nick Galbreath http://client9.com/20130501 @ngalbreath

Disk Layout

• Each metric create a directory treeserver123.myapp.logins.failed

• Makes 3 directories

• This creates a very branchy directory structure

• This has good and bad points.

Nick Galbreath http://client9.com/20130501 @ngalbreath

Middleware

carbon-cache

Part 1

Nick Galbreath http://client9.com/20130501 @ngalbreath

Carbon Cache

• Metrics go in

• csv file or python pickle format

• TCP

• Metrics go to disk

• whisper

Nick Galbreath http://client9.com/20130501 @ngalbreath

Write Buffer

• Most important feature is write buffering to protect the disk.

• Data is buffered and written out once per minute (or so).

• But

Nick Galbreath http://client9.com/20130501 @ngalbreath

The Cache• It's a write cache.

• Once data is written, it's out of the cache

• In other words, the cache is metrics not on disk.

• If the cache dies, you lose metrics

• (btw: the read cache is the os disk cache)

Nick Galbreath http://client9.com/20130501 @ngalbreath

New Metrics• New metrics are created automatically

• But, it is very expensive.

• MAX_CREATES_PER_MINUTE.=.50

• Saves your disk, but new metrics will "pile up" in cache.

• May take 10m+ for your metrics to start flowing....

Nick Galbreath http://client9.com/20130501 @ngalbreath

FALLOCATE• WHISPER_FALLOCATE_CREATE.=.True

• Linux&Kernel&>=&2.6.23

• fallocate is used to preallocate blocks to a file. For filesystems which support the fallocate system call, this is done quickly by allocating blocks and marking them as uninitialized, requiring no IO to the data blocks. This is much faster than creating a file by filling it with zeros.

• https://bugs.launchpad.net/whisper/+bug/957827

Nick Galbreath http://client9.com/20130501 @ngalbreath

Limit the SizeLimit the size of the cache to avoid swapping or becoming CPU bound.Sorts and serving cache queries gets more expensive as the cache grows.Use the value "inf" (infinity) for an unlimited cache size.

MAX_CACHE_SIZE = inf

No!.Infinity.does.not.exist.on.your.system!.&

Nick Galbreath http://client9.com/20130501 @ngalbreath

Graphite for GraphiteBy&default,&carbon&itself&will&log&statistics&(such&as&a&count,metricsReceived)&with&the&top&level&prefix&of&'carbon'&at&an&interval&of&60seconds.&Set&CARBON_METRIC_INTERVAL&to&0&to&disable&instrumentation

CARBON_METRIC_PREFIX&=&carbonCARBON_METRIC_INTERVAL&=&60

Nick Galbreath http://client9.com/20130501 @ngalbreath

Stats on

Stats!

Nick Galbreath http://client9.com/20130501 @ngalbreath

Middleware

carbon-aggregator

Part 2

Nick Galbreath http://client9.com/20130501 @ngalbreath

Pre-Aggregation• Sum or Average metrics based on

wildcards and regexps

• Helps eliminate very slow queries on webui

• You can emit the final sum & all the individual components or just the final sum (via blacklists)

Nick Galbreath http://client9.com/20130501 @ngalbreath

destination

• r/w to localhost

• split metrics to other aggregators

• Design your own system

Nick Galbreath http://client9.com/20130501 @ngalbreath

Along the way

• renaming of metrics

• whitelist and blacklist of aggregation and metrics

Nick Galbreath http://client9.com/20130501 @ngalbreath

Also...

• has support for broadcasting data to multiple downstream caches

• but.. never used it.. and seems at odds with the next middleware

Nick Galbreath http://client9.com/20130501 @ngalbreath

Middleware

carbon-relay

Part 3

Nick Galbreath http://client9.com/20130501 @ngalbreath

It's a Router!• Consistent Hashing (Sharding)

• Or more rule-based routing

• Output to multiple carbon servers

have not really used it much, but should work similarly to scale outs of memcache, redis

Nick Galbreath http://client9.com/20130501 @ngalbreath

Middleware

StatsDPart

4

Nick Galbreath http://client9.com/20130501 @ngalbreath

StatsD• https://github.com/etsy/statsd/

• nodejs based but lots of other implementations

• Receives UDP, send graphite-compatible output, flushed periodically.

• Aggregation for all by default

• Beside sum, also can compute other basic statistics (mean, 90% percentile), do sampling, have counters, etc.

Nick Galbreath http://client9.com/20130501 @ngalbreath

StatsD use case• It's UDP based, so it excels at

embedding a client inside the application

• UDP can't block or break the sending application

• Not so good for bulk metrics

• Use both! Can work together with aggregator.

Nick Galbreath http://client9.com/20130501 @ngalbreath

Of Note

• https://github.com/armon/statsite

• Need to look at this more

• c + libev based

• modern time series algorithms

• very flexible output

Nick Galbreath http://client9.com/20130501 @ngalbreath

Backups

Do you really need them?http://bit.ly/11sPhNz

Nick Galbreath http://client9.com/20130501 @ngalbreath

Backup

• Doing naive backup causes graphite performance goes to crap.

• File system cache is trashed

• Metrics are not written to disk (lag)

• If OOM occurs then you lose metrics.

Nick Galbreath http://client9.com/20130501 @ngalbreath

Do you need to save everything?

Nick Galbreath http://client9.com/20130501 @ngalbreath

nice is good

http://www.beenthereyet.net/nice-france

If you are doing your own backup....

Nick Galbreath http://client9.com/20130501 @ngalbreath

ionice is betterIONICE(1) User Commands IONICE(1)

NAME

ionice - set or get process I/O scheduling class and priority

SYNOPSIS

ionice [-c class] [-n level] [-t] -p PID... ionice [-c class] [-n level] [-t] command [argument...]

DESCRIPTION

This program sets or gets the I/O scheduling class and priority for a program. If no arguments or just -p is given, ionice will query the current I/O scheduling class and priority for that process.

When command is given, ionice will run this command with the given arguments. If no class is specified, then command will be executed with the "best-effort" scheduling class. The default priority level is 4.

NOTES

Linux supports I/O scheduling priorities and classes since 2.6.13 with the CFQ I/O scheduler.

util-linux July 2011 IONICE(1)

Nick Galbreath http://client9.com/20130501 @ngalbreath

Even Better

• Just write the metrics to two graphite servers in your client

• Script to copy / resync "holes" when restoring.

Nick Galbreath http://client9.com/20130501 @ngalbreath

Monitoring

Nick Galbreath http://client9.com/20130501 @ngalbreath

WebUI

• Hey, it's a web server

• do all the usual stuff

• Ask for known stats,

• check for 200

• check for valid json output

Nick Galbreath http://client9.com/20130501 @ngalbreath

Mistakes in URL or use of functions cause Server 500

Nick Galbreath http://client9.com/20130501 @ngalbreath

Old Stats

• Don't forget to kill off old metrics

• no updates in X days? kill.

• Exercise in "find" left to reader

Nick Galbreath http://client9.com/20130501 @ngalbreath

MySQL

• If you use SQLite3 -- uhh nothing to monitor

• If you use MySQL -- use the regular suspects

• And don't forget to backup!!

Nick Galbreath http://client9.com/20130501 @ngalbreath

CPU is stable• ... except for apache usage

• consider moving apache to separate machine

Nick Galbreath http://client9.com/20130501 @ngalbreath

Disk is SensitiveCompeting with other processes for disk does this

Nick Galbreath http://client9.com/20130501 @ngalbreath

Means less written to disk

Metrics updated

Nick Galbreath http://client9.com/20130501 @ngalbreath

Dangerous build up in cache

Cache Size

Rendering

Nick Galbreath http://client9.com/20130501 @ngalbreath

Tune Apache• By default, your Apache install is likely

to be "unlimited" in CPU and Memory usage.

• Select a wildcard metric for a long time period can easily turn a httpd process in 1GB. (this seems like a bug actually)

• OOM death.

Nick Galbreath http://client9.com/20130501 @ngalbreath

/version/

• Yes, ending "/" is required.

• Ok not that exciting but easy check

Nick Galbreath http://client9.com/20130501 @ngalbreath

/metrics/expand/

• /metrics/expand/?query=server*• {"results": ["server001", "server002", ... ]}

Nick Galbreath http://client9.com/20130501 @ngalbreath

/events/

• Ad-Hoc Events that don't deserve their own metric type.

• has tags, time, and text

• Stored in SQLite3 by default by the webapp.

• Rest UI is primitive

Nick Galbreath http://client9.com/20130501 @ngalbreath

The WebUI

• it's "ok".. good for experiments

• You will want to make your own dashboard.

• Good news! The API is a URL, so it's very easy.

Nick Galbreath http://client9.com/20130501 @ngalbreath

WebUI Dashboards

• The WebUI has a dashboard feature for loading and saving graphs

• It saves data in SQLite3 by default

• Since it's there people will use it

• So hack to remove it or, switch to MySQL.

Nick Galbreath http://client9.com/20130501 @ngalbreath

Granularity

• Like RRDTool, the resolution of the graph depends on number of pixels used. No sub-pixel rendering!

• Rapid spikes can be "averaged away" in week-long views in small graphs.

Nick Galbreath http://client9.com/20130501 @ngalbreath

Vertical Line Technology

• Easy to make horizontal lines

• Not so clear how to make ad-hoc vertical lines

Nick Galbreath http://client9.com/20130501 @ngalbreath

Turn "time since" into events

drawAsInfinite( removeAboveValue( keepLastValue( YOURMETRIC ), 120 ))

Nick Galbreath http://client9.com/20130501 @ngalbreath

Nick Galbreath http://client9.com/20130501 @ngalbreath

Turn Version Numbers

into EventsdrawAsInfinite( removeBelowValue( derivative(keepLastValue( YOURMETRIC)) ,0.1))

Nick Galbreath http://client9.com/20130501 @ngalbreath

Nick Galbreath http://client9.com/20130501 @ngalbreath

Arbitrary LinesdrawAsInfinite( removeBelowValue( removeAboveValue( time("time"), timestamp), timestamp))

Nick Galbreath http://client9.com/20130501 @ngalbreath

Really Long URLs

• Making graph but the URL is so long browsers are clipping them?

• Send query string data as a POST

Nick Galbreath http://client9.com/20130501 @ngalbreath

Client Side Rendering

• yeah...

• works ok with a small number of points

• crashes existing browsers with large number of points

• Server side faster in many cases!

• We'll try again in 2014

Nick Galbreath http://client9.com/20130501 @ngalbreath

Colors and ChartJunk

• Default color scheme is gross

• Be kind to the handicapped (uhh, me)http://colorbrewer2.org/

• Good overview here:http://bit.ly/10Hu7zU

Nick Galbreath http://client9.com/20130501 @ngalbreath

Looking for something to do?

Nick Galbreath http://client9.com/20130501 @ngalbreath

Accelerate with PyPy

• JIT for Python

• ~ 5.9x performance improvement

• Actually works and is stable

• Compatible with twisted and Django

Nick Galbreath http://client9.com/20130501 @ngalbreath

Accelerate with numpy

• numpy provides fast vector manipulation (C code)

• graphite web gui does a lot of vector manipulation

• hmmmm.....

Nick Galbreath http://client9.com/20130501 @ngalbreath

Ceres Storage Engine

• "Eventually Fixed Size" storage

• More space efficient == more performance

• seehttp://blog.sweetiq.com/2013/01/using-ceres-as-the-back-end-database-to-graphite/

Nick Galbreath http://client9.com/20130501 @ngalbreath

OpenTSB• Not Graphite, but similar in spirit

• Has "collectors" for basic ops stats

• Used by StumbleUpon, Box.net, pintrest

• Good: Stores data in HBASE/Hadoop

• Bad: Stores data in HBASE/Hadoop

Nick Galbreath http://client9.com/20130501 @ngalbreath

Add More Functions

• coursen (I'm looking at you Ian Malpass, that's useful for client-side rendering)

• Real vertical lines (our hacks are stupid)

• Better operators (would nice to know easily how many metrics you have, e.g. select count(*))

Nick Galbreath http://client9.com/20130501 @ngalbreath

Mine the Apache Log• Which stats are used the most?

• What are really slow queries?

• Can you optimize them?

• What time frames are used?

• How much old data do you really need to store?

it's in the

query string

Nick Galbreath http://client9.com/20130501 @ngalbreath

Add a TinyURL Feature

• The URLs get really long and are hard to put into email, etc.

• Make a tinyurl feature into the django app and integrate into dashboard.

Nick Galbreath http://client9.com/20130501 @ngalbreath

Write Docs

yeah you!

Nick Galbreath http://client9.com/20130501 @ngalbreath

Nick Galbreathhttp://www.client9.com/nickg@client9.com

http://www.iponweb.com/ngalbreath@iponweb.net

Lets Make

Some Graphs!

top related