munich, 9th august 2018 - promcon · prometheus 2.x reliable operational model powerful query...

52
Bartek Plotka Bwplotka Bplotka Fabian Reinartz fabxc Global, durable Prometheus monitoring Munich, 9th August 2018

Upload: others

Post on 14-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Bartek Plotka Bwplotka Bplotka

Fabian Reinartz fabxc

Global, durable Prometheus monitoring

Munich, 9th August 2018

Page 2: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Prometheus 2.X

● Reliable operational model● Powerful query language● Scraping capabilities beyond the casual usage● Local metric storage

Prometheus

Page 3: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Cluster 1

Prometheus at Scale

Cluster 2

Prometheus

Cluster n

Cluster n+1

Prometheus...

Page 4: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Cluster 1

Problem: Global View

Cluster 2

Prometheus

Cluster n

Cluster n+1

Prometheus...

Grafana

Alertmanager

Page 5: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Cluster 1

Problem: Global View

Cluster 2

Prometheus

Cluster n

Cluster n+1

Prometheus...

Grafana

Alertmanager

sum(rate(go_memstats_alloc_bytes_total[1m])) by (env, cluster, job) ?

Page 6: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Cluster 1

Problem: Global View

Cluster 2

Prometheus

Cluster n

Cluster n+1

Prometheus...

Grafana

Alertmanager

sum(go_memstats_alloc_bytes_total::rate1m) by (env, cluster, job) ✓

Prometheus

/federate

Page 7: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Cluster 1

Problem: High Availability

Cluster 2

Prometheus

Cluster n

Cluster n+1

Prometheus...

Grafana

Alertmanager

Page 8: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Cluster 1

Problem: High Availability

PrometheusCluster 2

Prometheus

Cluster n

Cluster n+1

Prometheus...

Grafana

Alertmanager

Prometheus Prometheus

Page 9: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Cluster 1

Problem: High Availability

PrometheusCluster 2

Prometheus

Cluster n

Cluster n+1

Prometheus...

Grafana

Alertmanager

Prometheus Prometheus

“Which replica to use?”

Page 10: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Problem: Metric retention

Page 11: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Problem: Metric retention

SSD

Prometheus

PrometheusRemote write

Page 12: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Thanos

Goals

● Have a global view● Have a HA in place● Increase retention

Page 13: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Global View

See everything from a single place!

Page 14: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

SSD

Prometheus

PrometheusTargets

Page 15: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

SSD

Sidecar

Prometheus SidecarTargets

gRPC (Store API)

Page 16: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Store API

service Store {

rpc Series(SeriesRequest) returns (stream SeriesResponse);

rpc LabelNames(LabelNamesRequest) returns (LabelNamesResponse);

rpc LabelValues(LabelValuesRequest) returns (LabelValuesResponse);

}

message SeriesRequest {

int64 min_time = 1;

int64 max_time = 2;

repeated LabelMatcher matchers = 3;

}

Sidecar

Prometheus

remote read

Store API

Page 17: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

SSD

Querier

Prometheus Sidecar

Querier

Store API

Targets

HTTP Query API

Page 18: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

SSD

Global View

Prometheus Sidecar

Querier

Targets

SSD

SidecarTargets

Prometheus

Merge

Store API

Page 19: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

SSD

Global View + Availability

Prometheus SidecarTargets

SSD

Sidecar

Targets

Prometheus

SSD

Sidecar Prometheus

“replica”:”1”

“replica”:”2”

QuerierMerge

Deduplicate

Store API

Page 20: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Thanos

Goals

● Have a global view ✓● Have a HA in place ✓

Prometheus Sidecar

SSD

Sidecar PrometheusSidecar Prometheus

Querier

Page 21: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Historical Metrics

What exactly happened X months ago?

Page 22: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

TSDB Layout

Block 2 Block 4Block 3Block 1

T-10hT-16h T-4h T-2h T

Page 23: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

TSDB Layout

Block 4Block 3Block 1

chunks chunks

chunks chunks

index

T-10hT-16h T-4h T-2h T

Page 24: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

SSD

Data saving

Prometheus SidecarTargets

Object Storage

Blocks Blocks

Block

Page 25: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

SSD

Data saving

Prometheus SidecarTargets

Object Storage

Blocks Blocks

Block

--storage.tsdb.max-block-duration=2h --storage.tsdb.retention=12h

Page 26: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Store Gateway

Object Storage

BlocksCache

Store

Querier

Store API

Page 27: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Thanos

Goals

● Have a global view ✓● Have a HA in place ✓● Increase retention ✓

Page 28: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Prometheus

Querier

Scrape EngineCompactor

Rule & Alert Engine

Prometheus

Page 29: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Thanos

Scrape EngineCompactor

Rule & Alert Engine

Thanos QuerierThanos Querier

Thanos Querier

Page 30: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Thanos

Compactor

Rule & Alert Engine

Thanos QuerierThanos Querier

Thanos Querier

SSD

Prometheus Sidecar

SSD

Prometheus Sidecar

SSD

Prometheus Sidecar

Page 31: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Thanos

Compactor

Thanos QuerierThanos Querier

Thanos Querier

SSD

Prometheus Sidecar

SSD

Prometheus Sidecar

SSD

Prometheus Sidecar

Thanos RulerThanos Ruler

Page 32: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Thanos

Thanos RulerThanos Ruler

Thanos QuerierThanos Querier

Thanos Querier

SSD

Prometheus Sidecar

SSD

Prometheus Sidecar

SSD

Prometheus SidecarGlobal Compactor

Page 33: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Thanos

Store GatewayStore

Gateway

Object StorageSSD

Prometheus Sidecar

SSD

Prometheus Sidecar

SSD

Prometheus Sidecar

Thanos RulerThanos Ruler

Global Compactor

Thanos QuerierThanos Querier

Thanos Querier

Page 34: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Deployment Models

Page 35: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Federation

QuerierQuerierQuerierStoreBucket

QuerierQuerierQuerier

StoreBucket

QuerierQuerierQuerierStoreBucket

Cluster A (master)

Cluster B

Cluster C

++

Federation (through Store API)++

++

Page 36: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Example Deployment

Cluster 1

Cluster 2

+

Cluster n

Cluster n+1

+...

+

Core Cluster

Grafana

Alertmanager

Bucket

Compactor

Querier Querier

Querier

Ruler Store

Statically configured

+

Page 37: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Example Global Deployment

++++

++++

++++

++++

++++

++++

Testing Staging

Production Querier Querier

Querier

Page 38: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Bonus: Downsampling

Page 39: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Downsampling

Raw: 16 bytes/sample

Compressed: 1.07 bytes/sample

Page 40: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Downsampling

BUT…

Page 41: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Downsampling

Decompressing one sample takes 10-40 nanoseconds

● Times 1000 series @ 30s scrape interval

● Times 1 year

Page 42: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Downsampling

Decompressing one sample takes 10-40 nanoseconds

● Times 1000 series @ 30s scrape interval

● Times 1 year

● Over 1 billion samples, i.e. 10-40s – for decoding alone

● Plus your actual computation over all those samples, e.g. rate()

Page 43: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Downsampling

BlockRAW

Block@ 5m

Block@ 1h

10x 12x

Page 44: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Downsampling

chunk

count sum min max counter

chunk...

Page 45: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Downsampling

count sum min max counter

count_over_time(requests_total[1h])

Page 46: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Downsampling

count sum min max counter

sum_over_time(requests_total[1h])

Page 47: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Downsampling

count sum min max counter

min(requests_total)

min_over_time(requests_total[1h])

Page 48: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Downsampling

count sum min max counter

max(requests_total)

max_over_time(requests_total[1h])

Page 49: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Downsampling

count sum min max counter

rate(requests_total[1h])

increase(requests_total[1h])

Page 50: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Downsampling

count sum min max counter

requests_total

avg(requests_total)

...

*

avg

Page 51: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Thanos

Goals

● Have a global view ✓● Have a HA in place ✓● Increase retention ✓

Page 52: Munich, 9th August 2018 - PromCon · Prometheus 2.X Reliable operational model Powerful query language Scraping capabilities beyond the casual usage Local metric storage

Any questions?

github.com/improbable-eng/thanos

Fabian Reinartz fabxc

Bartek Plotka bwplotka Bplotka