scaling graphite at yelp

Post on 21-Apr-2017

6.613 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

…And Metrics For All

Paul O’Connorgithub.com/pauloconnor2015-05-19

About Yelp

Founded: 2004Monthly Active Users: ~142 MillionNon-US Monthly Users: ~31 MillionReview: ~77 MillionLocal Businesses: 2.1 MillionTerritories: Available in 31 countries

What are metrics?

Name Value

What are metrics?

Name Value Timestamp

What are metrics?

Name Value Timestampserver1.load.1m 28.826667 1431950640

What are metrics?

Name Value Timestampserver1.load.1m 28.826667 1431950640server1.load.1m 29.188333 1431950700server1.load.1m 29.231667 1431950760server1.load.1m 29.083333 1431950820server1.load.1m 29.710000 1431950880

What are metrics?

Name Value Timestampserver1.load.1m 28.826667 1431950640server1.load.1m 29.188333 1431950700server1.load.1m 29.231667 1431950760server1.load.1m 29.083333 1431950820server1.load.1m 29.710000 1431950880

Graphite Components

• Carbon:• relay• cache• aggregator

• Whisper• Web app

Carbon Relay

• Deals with 2 things• Replication• Sharding

Relay Methods

• Rules• [replicate]• pattern = ^services\.ads\..+• servers = 10.1.2.3, 10.2.2.3• continue = true

• Consistent Hashing• Defines a sharding strategy across multiple backends

Carbon Cache

• Receives metrics and persists them to disk• Writes based on storage schemas

Storage Schemas

• Details retention rates for storing metrics

[databases_10sec_1year]pattern = ^servers\.db.*$retentions = 10s:7d,1m:30d,5m:90d,30m:365d

Storage Aggregation

• Rules for aggregating data to lower-precision retentions

[all_min]pattern = \.min$xFilesFactor = 0.1aggregationMethod = min

Carbon Aggregator

• Buffers metrics before forwarding to carbon cache• Roll up metrics based on rules

Aggregation Rules

• Not to be confused with storage aggregation• Tells the carbon aggregator what to aggregate and how

output_template (frequency) = method input_pattern

<env>.applications.<app>.all.requests (60) = sum <env>.applications.<app>.*.requests

prod.applications.apache.www01.requestsprod.applications.apache.www02.requestsprod.applications.apache.www03.requestsprod.applications.apache.www04.requestsprod.applications.apache.www05.requests

prod.applications.apache.all.requests

Whisper

• Fixed size database• Allows for roll ups• Allows for backfilling data

Web App

• Django based app for rendering graphs

Putting it all together

• Carbon cache listening on port 2003• Write to disk• Listen with web

Getting more complicated

• Carbon relay using consistent hashing to multiple caches• Individual caches responsible for specific metrics

More Relays

• Use HAProxy to load balance between relays• Use more relays to use CPU

Even more relays• Useful for sending metrics to other locations

Replicate the metrics• Duplicate your metrics for backup, and redundancy

More caches instead• Consistent hash across multiple nodes

Where does the aggregator fit?

• Aggregator uses a lot of CPU. Put it on it’s own node

Scaling further

• Use nodes for particular functions:• Use forwarding relay nodes solely to forward• Have consistent hashing nodes• Have aggregation nodes

Getting your data back out

• Graphite Dashboard• Third Party Dashboard

• We use Grafana http://grafana.org/• Graphite-api https://github.com/brutasse/graphite-api

Tips

• Aggregate before ingestion• Control the metrics that can be sent• Metrics are a gas - they expand to fill all available room• Use C implementation of carbon• Use the latest webapp.

Optimize your dashboard queries

• services.biz_app.*.*.timers.pyramid_uwsgi_metrics_tweens_*.p99• 2154 results• 35 seconds to just find these files on disk• Running functions against these results• Timeout after a minute• Dashboard automatically refreshing every 10 seconds

What’s the Future?

• InfluxDB• Cassandra• Third party

We’re hiring!http://www.yelp.com/careersHiring SREs in Dublin, London, New York, San Francisco

top related