monitoring at a saas startup: tradeoffs and tools
DESCRIPTION
I gave this talk at MinneBar 2014: http://sessions.minnestar.org/sessions/162 When I joined a SaaS startup already in progress as their first ops hire, what monitoring existed was a twisty maze of half-measures. The devteam dreaded oncall, and our Mean Time To Lost Sleep was way too low. Improving visibility into our infrastructure and application performance required trying new tools and changing how we thought about what we were measuring. Join me for a tragicomic journey from the vale of blissful ignorance through the straits of Nagios and into the mountains of Graphite. We'll talk tools and pitfalls, missteps and dead ends, and everything we haven't yet done but should. Tools covered will include Nagios, StatsD, Graphite, and Sentry, with some digressions into others such as NewRelic and MMS.TRANSCRIPT
![Page 1: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/1.jpg)
Monitoring at a SaaS Startup
Tradeoffs and Tools
Bridget Kromhout
![Page 2: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/2.jpg)
8thbridge.comsmall social commerce startupacquired in the last week by Fluid, Inc.small devteamI am the ops team
![Page 3: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/3.jpg)
twisty maze of little shell scripts
bespoke artisanal monitoring
difficult to modify;doesn’t scale
http://www.pcgameshardware.de/screenshots/1280x1024/2007/07/CA01.jpg
![Page 4: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/4.jpg)
New Relic
pros:nice graphsapplication-level viewgood error analysis
cons:slow to updatemany false-positive alertshigh prices (better now)
![Page 5: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/5.jpg)
MotivatingChange
http://99designs.com/illustrations/contests/illustration-pagerduty-161025/entries
![Page 6: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/6.jpg)
: as hideous as you remember
![Page 7: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/7.jpg)
https://laur.ie/blog/2014/02/why-ill-be-letting-nagios-live-on-a-bit-longer-thank-you-very-much/
“Horrendous interface”“Well, it’s more “old” than anything
else. At least everything is in the
same place as you left it because it’s
been the same since 1912.”
![Page 8: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/8.jpg)
“Sensu has so many moving parts that I wouldn’t be able to sleep at night unless I set up a Nagios instance to make sure they were all running.”
-- @murphy_slaw (via @lozzd)
![Page 9: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/9.jpg)
HBase: monitor all the ports?!?
hbck: the HBase consistency checker
nagios -> bash script -> parsing output of hbck
http://www.ymc.ch/en/how-to-monitor-hbase-health-by-nagios
![Page 10: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/10.jpg)
adding alert after alert after...
![Page 11: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/11.jpg)
http://modiinhub.com/wp-content/uploads/2014/02/logo-mongodb-tagline.png
![Page 12: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/12.jpg)
![Page 13: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/13.jpg)
MMS (MongoDB Monitoring Service)
![Page 14: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/14.jpg)
“cyber” monday: 1988 called; wants its word back.
the rewards of hubris
MMS showed the issue but we weren't alerting on it didn't understand the global write lock
![Page 15: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/15.jpg)
If it moves, we track it. Sometimes we’ll draw a graph of something that isn’t moving
yet, just in case it decides to make a run for it. -- @indec
http://codeascraft.com/2011/02/15/measure-anything-measure-everything/
![Page 16: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/16.jpg)
Graphite & StatsD
➔ Graphite◆ Store and visualize time-series data◆ http://graphite.readthedocs.org/
➔ StatsD ◆ Measure everything! (Timers, counters, events, …)◆ https://github.com/etsy/statsd/
![Page 17: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/17.jpg)
Where we were
➔ Graphite 0.9.9 (wanted 0.9.12)◆ over 2 years old◆ missing new features (Consolidate by!)
➔ StatsD was newish, but…◆ hand-rolled◆ running in a screen session◆ on a special snowflake box
![Page 18: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/18.jpg)
Community cookbooks?
➔ Graphite ones good, but…◆ focus on Apache (we use nginx)◆ we haven’t moved to Chef 11 (gasp!)
➔ StatsD◆ https://github.com/librato/statsd-cookbook◆ launches daemons via upstart◆ generates config file based on attributes
![Page 19: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/19.jpg)
Graphite cookbook (Part 1)
➔ Install in a virtualenv (django, uwsgi, nginx)➔ Dependencies recommended
◆ https://github.com/graphite-project/graphite-web/blob/master/requirements.txt
➔ libcairo2-dev package on Ubuntu 12.04 LTS➔ install graphite’s 3 parts via pip
![Page 20: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/20.jpg)
Graphite cookbook (Part 2)
➔ graphite-web◆ Django app, renders graphs
➔ whisper◆ fixed-size database for storing time-series data◆ like RRD
➔ carbon◆ carbon-cache.py - stores data◆ carbon-aggregator.py - buffers, then stores◆ carbon-relay.py - for sharding/replication
![Page 21: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/21.jpg)
when in doubt: tcpdump is your friend
http://blog.johngoulah.com/2012/10/looking-under-the-covers-of-statsd/
![Page 22: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/22.jpg)
carbon-aggravator (between 0.9.10 & 0.9.12)
# If set true, metric received will be forwarded to# DESTINATIONS in addition to# the output of the aggregation rules. If set false # the carbon-aggregator will# only ever send the output of aggregation.FORWARD_ALL = True
![Page 23: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/23.jpg)
Carbonate
whisper-fill.py
backfill datapoints between whisper files
![Page 24: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/24.jpg)
2am: sudden drop-off
8am: look at graphs: ?!?!
10am: and we’re back.
![Page 25: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/25.jpg)
What’s next?
![Page 26: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/26.jpg)
❏ finds real problems❏ actionable alerting❏ usable by all❏ …?
the ideal monitoring solution...
http://www.quickmeme.com/img/f5/f512ff9bee084263df5571d3c81388019dcb063173e1dbcfa2babac9274576b6.jpg
![Page 27: Monitoring at a SAAS Startup: Tradeoffs and Tools](https://reader034.vdocuments.mx/reader034/viewer/2022052310/554f944bb4c905d25b8b53c6/html5/thumbnails/27.jpg)
What we’re actually using now
StatsDApplication-level error analysis
Alarms for autoscaling
Timers & counters
Log & host-level
Hadoop & HBase visualization
MongoDBGraphs
Time-series data graphing
client-side plugins
External uptime checksoncall rotation/alerting
Threshold-based alarms
Dashboard