promcon 2016 lightning talk

14
Why we love Prometheus (and you should too) A few stories to prove a point

Upload: gil-fliker

Post on 12-Apr-2017

30 views

Category:

Engineering


0 download

TRANSCRIPT

WhywelovePrometheus(andyoushouldtoo)

Afewstoriestoproveapoint

TheVisibilityteam@OutbrainGilFliker,ShalomCohen,MinaNaguib

Welistenhere

[email protected]

SoOutbrain

Outbrain is the world’s largest content discovery platform, bringing personalized, relevant online, mobile and video content to audiences Outbrain serves

557 MILLION Monthlyuniquevisitors.

250BILLION personalizedcontentrecommendationseverymonth.

63% OFUSONLINEPopulationuseOutbrain

Few notes on how Outbrain does alertingEverything gets an owner - not as easy as you might think.

Everything is available via self serve tools - Instrumenting code, Alerting, Graphing.

Each team has its own on call rotation and is responsible for monitoring his services.

Automate everything new servers, services… .

Wherewestarted- YesNagiosCollectd isusedfor3rdpartyservicesandOSlevelmetrics.

OutbraincodesendsmetricsdirectlytoGraphite.

4datacentersthousandsofserversgodknowshowmanychecks.

4Nagiosagentsonemaster.

Usinghome-grownautomation,customcheck Graphitescript andGrafana.

2xGraphiteclustersbeinghitwith18Mmetrics/minute each.(thisisawholeotherstory)

APrometheusdefinitionPrometheusis:

Amonitoringandalertingsystemfordistributedsystemsandinfrastructure.

Prometheusisnot:

Along-termarchivalsystem,abusinessintelligencereportingsystem,data-miningbackend.

Benjamin Staffin @ Fitbit Site Operations

Prometheus featureswelikeOperationalsimplicity,noexternaldependency,singlebinary(Go).

Powerfulquerylanguage.

Manynativeservicediscoveryintegrations.- weuseConsul

And

Keepsuswarmbecauseyouknow“TheNightisDarkandfullofTerrors.”

Three stories as an example

The story of Cumulus switches monitoring

The story of Cumulus switches monitoring

ThestoryofNagiosmonitoring

ThestoryofRevee,living@AWS

Famous last words before you go

- Don't start without automation.

- Make sure you have enough bandwidth for education, it is not an easy switch.

- Run things in parallel , old and new.

- Join relevant forums the majority of questions were asked (beside your version).

Wanttojointheteam– [email protected]

Wanttoaskaquestionorjustchat?Catchmeafterthisisdone.