care and feeding of large scale graphite installations - devopsdays austin 2013

81
Nick Galbreath http://client9.com/20130501 @ngalbreath Care and Feeding of Large Scale Graphite Installations Nick Galbreath IPONWEB DevOpsDays Austin Texas 2013-04-30

Upload: nick-galbreath

Post on 15-Jan-2015

1.139 views

Category:

Technology


0 download

DESCRIPTION

Tip and tricks of using and scaling graphite. First presented at DevOpsDays Austin Texas 2013-05-01

TRANSCRIPT

Page 1: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Care and Feeding ofLarge Scale

Graphite Installations

Nick Galbreath ★ IPONWEBDevOpsDays ★ Austin Texas ★ 2013-04-30

Page 2: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

http://client9.com/20130501

Page 3: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Who is nickg?

Page 4: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

(that's online advertising infrastructure)

Page 5: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

• Over One Billion Points collected daily.

• In "many" independent Graphite clusters.

Page 6: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

so, graphite?

Page 7: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Who cares? • Making it easy to create, analyze and

share data can change your organization

• Making a data-driven culture

• Empowering developers, operations, qa, security and business to be more confident in the changes they make.

Page 8: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

What is it you say you do here?

• Your job is likely invisible to rest of the organization

• invisible things aren't valued

• so make what you do visible

Page 9: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Why Graphite?

• Many innovations in each part of the stack

• But, it's the Full Stack that really makes it special.

• On-disk layout to UI to API to... the community around it.

Page 10: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Sharing in Caring

• Allows data to be easily accessed

• And easily shared. This makes it different than many monitoring solutions.

• It's your own in-house mashup generator.

Page 11: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

What is it?

Page 12: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

The Big Picture

• It's a database

• But not ACID

• And without all the database tools

Page 13: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Installation• 3 python-twistd servers

• "carbon-cache"

• "cache-aggregator"

• "carbon-relay"

• Apache / Django Web UI and API

• Uses SQLite3/MySQL for dashboards / events

Page 14: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Be Current

• Don't use the OS default version.

• Newer point releases of graphite have significant improvements in storage engine and webui/api

• It's 100% Python so "building it yourself" shouldn't to hard.

• pip install works and is current.

Page 15: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

The Documentation

• everyone complains about it

• historically bad, but getting a lot better

• Switched locations, but not all searchengines are updated to use:http://graphite.readthedocs.org/

• Source code is quite good, so RTFS

Page 16: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Storage Engine

Page 17: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

What is it?• Storage engine

• Handles reads, writes and creates of a single metric to a fixed size file.

• One file, kinda dumb (good).

• Here's the API:https://github.com/graphite-project/whisper/blob/master/whisper.py

Page 18: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Graphite Math• About 12 bytes per point.

• Store 1 minute points for 1 month and 15 minutes for 11 months.

• (60×24×30 + 4×24×30×11) ×12 = 878kB

• If you can keep all your points in memory, then magic!

Page 19: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Disk Layout

• Each metric create a directory treeserver123.myapp.logins.failed

• Makes 3 directories

• This creates a very branchy directory structure

• This has good and bad points.

Page 20: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Middleware

carbon-cache

Part 1

Page 21: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Carbon Cache

• Metrics go in

• csv file or python pickle format

• TCP

• Metrics go to disk

• whisper

Page 22: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Write Buffer

• Most important feature is write buffering to protect the disk.

• Data is buffered and written out once per minute (or so).

• But

Page 23: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

The Cache• It's a write cache.

• Once data is written, it's out of the cache

• In other words, the cache is metrics not on disk.

• If the cache dies, you lose metrics

• (btw: the read cache is the os disk cache)

Page 24: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

New Metrics• New metrics are created automatically

• But, it is very expensive.

• MAX_CREATES_PER_MINUTE.=.50

• Saves your disk, but new metrics will "pile up" in cache.

• May take 10m+ for your metrics to start flowing....

Page 25: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

FALLOCATE• WHISPER_FALLOCATE_CREATE.=.True

• Linux&Kernel&>=&2.6.23

• fallocate is used to preallocate blocks to a file. For filesystems which support the fallocate system call, this is done quickly by allocating blocks and marking them as uninitialized, requiring no IO to the data blocks. This is much faster than creating a file by filling it with zeros.

• https://bugs.launchpad.net/whisper/+bug/957827

Page 26: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Limit the SizeLimit the size of the cache to avoid swapping or becoming CPU bound.Sorts and serving cache queries gets more expensive as the cache grows.Use the value "inf" (infinity) for an unlimited cache size.

MAX_CACHE_SIZE = inf

No!.Infinity.does.not.exist.on.your.system!.&

Page 27: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Graphite for GraphiteBy&default,&carbon&itself&will&log&statistics&(such&as&a&count,metricsReceived)&with&the&top&level&prefix&of&'carbon'&at&an&interval&of&60seconds.&Set&CARBON_METRIC_INTERVAL&to&0&to&disable&instrumentation

CARBON_METRIC_PREFIX&=&carbonCARBON_METRIC_INTERVAL&=&60

Page 28: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Stats on

Stats!

Page 29: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Middleware

carbon-aggregator

Part 2

Page 30: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Pre-Aggregation• Sum or Average metrics based on

wildcards and regexps

• Helps eliminate very slow queries on webui

• You can emit the final sum & all the individual components or just the final sum (via blacklists)

Page 31: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

destination

• r/w to localhost

• split metrics to other aggregators

• Design your own system

Page 32: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Along the way

• renaming of metrics

• whitelist and blacklist of aggregation and metrics

Page 33: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Also...

• has support for broadcasting data to multiple downstream caches

• but.. never used it.. and seems at odds with the next middleware

Page 34: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Middleware

carbon-relay

Part 3

Page 35: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

It's a Router!• Consistent Hashing (Sharding)

• Or more rule-based routing

• Output to multiple carbon servers

have not really used it much, but should work similarly to scale outs of memcache, redis

Page 36: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Middleware

StatsDPart

4

Page 37: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

StatsD• https://github.com/etsy/statsd/

• nodejs based but lots of other implementations

• Receives UDP, send graphite-compatible output, flushed periodically.

• Aggregation for all by default

• Beside sum, also can compute other basic statistics (mean, 90% percentile), do sampling, have counters, etc.

Page 38: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

StatsD use case• It's UDP based, so it excels at

embedding a client inside the application

• UDP can't block or break the sending application

• Not so good for bulk metrics

• Use both! Can work together with aggregator.

Page 39: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Of Note

• https://github.com/armon/statsite

• Need to look at this more

• c + libev based

• modern time series algorithms

• very flexible output

Page 40: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Backups

Do you really need them?http://bit.ly/11sPhNz

Page 41: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Backup

• Doing naive backup causes graphite performance goes to crap.

• File system cache is trashed

• Metrics are not written to disk (lag)

• If OOM occurs then you lose metrics.

Page 42: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Do you need to save everything?

Page 43: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

nice is good

http://www.beenthereyet.net/nice-france

If you are doing your own backup....

Page 44: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

ionice is betterIONICE(1) User Commands IONICE(1)

NAME

ionice - set or get process I/O scheduling class and priority

SYNOPSIS

ionice [-c class] [-n level] [-t] -p PID... ionice [-c class] [-n level] [-t] command [argument...]

DESCRIPTION

This program sets or gets the I/O scheduling class and priority for a program. If no arguments or just -p is given, ionice will query the current I/O scheduling class and priority for that process.

When command is given, ionice will run this command with the given arguments. If no class is specified, then command will be executed with the "best-effort" scheduling class. The default priority level is 4.

NOTES

Linux supports I/O scheduling priorities and classes since 2.6.13 with the CFQ I/O scheduler.

util-linux July 2011 IONICE(1)

Page 45: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Even Better

• Just write the metrics to two graphite servers in your client

• Script to copy / resync "holes" when restoring.

Page 46: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Monitoring

Page 47: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

WebUI

• Hey, it's a web server

• do all the usual stuff

• Ask for known stats,

• check for 200

• check for valid json output

Page 48: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Mistakes in URL or use of functions cause Server 500

Page 49: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Old Stats

• Don't forget to kill off old metrics

• no updates in X days? kill.

• Exercise in "find" left to reader

Page 50: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

MySQL

• If you use SQLite3 -- uhh nothing to monitor

• If you use MySQL -- use the regular suspects

• And don't forget to backup!!

Page 51: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

CPU is stable• ... except for apache usage

• consider moving apache to separate machine

Page 52: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Disk is SensitiveCompeting with other processes for disk does this

Page 53: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Means less written to disk

Metrics updated

Page 54: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Dangerous build up in cache

Cache Size

Page 55: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Rendering

Page 56: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Tune Apache• By default, your Apache install is likely

to be "unlimited" in CPU and Memory usage.

• Select a wildcard metric for a long time period can easily turn a httpd process in 1GB. (this seems like a bug actually)

• OOM death.

Page 57: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

/version/

• Yes, ending "/" is required.

• Ok not that exciting but easy check

Page 58: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

/metrics/expand/

• /metrics/expand/?query=server*• {"results": ["server001", "server002", ... ]}

Page 59: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

/events/

• Ad-Hoc Events that don't deserve their own metric type.

• has tags, time, and text

• Stored in SQLite3 by default by the webapp.

• Rest UI is primitive

Page 60: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

The WebUI

• it's "ok".. good for experiments

• You will want to make your own dashboard.

• Good news! The API is a URL, so it's very easy.

Page 61: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

WebUI Dashboards

• The WebUI has a dashboard feature for loading and saving graphs

• It saves data in SQLite3 by default

• Since it's there people will use it

• So hack to remove it or, switch to MySQL.

Page 62: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Granularity

• Like RRDTool, the resolution of the graph depends on number of pixels used. No sub-pixel rendering!

• Rapid spikes can be "averaged away" in week-long views in small graphs.

Page 63: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Vertical Line Technology

• Easy to make horizontal lines

• Not so clear how to make ad-hoc vertical lines

Page 64: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Turn "time since" into events

drawAsInfinite( removeAboveValue( keepLastValue( YOURMETRIC ), 120 ))

Page 65: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Page 66: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Turn Version Numbers

into EventsdrawAsInfinite( removeBelowValue( derivative(keepLastValue( YOURMETRIC)) ,0.1))

Page 67: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Page 68: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Arbitrary LinesdrawAsInfinite( removeBelowValue( removeAboveValue( time("time"), timestamp), timestamp))

Page 69: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Really Long URLs

• Making graph but the URL is so long browsers are clipping them?

• Send query string data as a POST

Page 70: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Client Side Rendering

• yeah...

• works ok with a small number of points

• crashes existing browsers with large number of points

• Server side faster in many cases!

• We'll try again in 2014

Page 71: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Colors and ChartJunk

• Default color scheme is gross

• Be kind to the handicapped (uhh, me)http://colorbrewer2.org/

• Good overview here:http://bit.ly/10Hu7zU

Page 72: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Looking for something to do?

Page 73: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Accelerate with PyPy

• JIT for Python

• ~ 5.9x performance improvement

• Actually works and is stable

• Compatible with twisted and Django

Page 74: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Accelerate with numpy

• numpy provides fast vector manipulation (C code)

• graphite web gui does a lot of vector manipulation

• hmmmm.....

Page 75: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Ceres Storage Engine

• "Eventually Fixed Size" storage

• More space efficient == more performance

• seehttp://blog.sweetiq.com/2013/01/using-ceres-as-the-back-end-database-to-graphite/

Page 76: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

OpenTSB• Not Graphite, but similar in spirit

• Has "collectors" for basic ops stats

• Used by StumbleUpon, Box.net, pintrest

• Good: Stores data in HBASE/Hadoop

• Bad: Stores data in HBASE/Hadoop

Page 77: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Add More Functions

• coursen (I'm looking at you Ian Malpass, that's useful for client-side rendering)

• Real vertical lines (our hacks are stupid)

• Better operators (would nice to know easily how many metrics you have, e.g. select count(*))

Page 78: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Mine the Apache Log• Which stats are used the most?

• What are really slow queries?

• Can you optimize them?

• What time frames are used?

• How much old data do you really need to store?

it's in the

query string

Page 79: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Add a TinyURL Feature

• The URLs get really long and are hard to put into email, etc.

• Make a tinyurl feature into the django app and integrate into dashboard.

Page 80: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Write Docs

yeah you!

Page 81: Care and Feeding of Large Scale Graphite Installations - DevOpsDays Austin 2013

Nick Galbreath http://client9.com/20130501 @ngalbreath

Nick Galbreathhttp://www.client9.com/[email protected]

http://www.iponweb.com/[email protected]

Lets Make

Some Graphs!