munich, 9th august 2018 - promcon · prometheus 2.x reliable operational model powerful query...

Post on 14-Jun-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Bartek Plotka Bwplotka Bplotka

Fabian Reinartz fabxc

Global, durable Prometheus monitoring

Munich, 9th August 2018

Prometheus 2.X

● Reliable operational model● Powerful query language● Scraping capabilities beyond the casual usage● Local metric storage

Prometheus

Cluster 1

Prometheus at Scale

Cluster 2

Prometheus

Cluster n

Cluster n+1

Prometheus...

Cluster 1

Problem: Global View

Cluster 2

Prometheus

Cluster n

Cluster n+1

Prometheus...

Grafana

Alertmanager

Cluster 1

Problem: Global View

Cluster 2

Prometheus

Cluster n

Cluster n+1

Prometheus...

Grafana

Alertmanager

sum(rate(go_memstats_alloc_bytes_total[1m])) by (env, cluster, job) ?

Cluster 1

Problem: Global View

Cluster 2

Prometheus

Cluster n

Cluster n+1

Prometheus...

Grafana

Alertmanager

sum(go_memstats_alloc_bytes_total::rate1m) by (env, cluster, job) ✓

Prometheus

/federate

Cluster 1

Problem: High Availability

Cluster 2

Prometheus

Cluster n

Cluster n+1

Prometheus...

Grafana

Alertmanager

Cluster 1

Problem: High Availability

PrometheusCluster 2

Prometheus

Cluster n

Cluster n+1

Prometheus...

Grafana

Alertmanager

Prometheus Prometheus

Cluster 1

Problem: High Availability

PrometheusCluster 2

Prometheus

Cluster n

Cluster n+1

Prometheus...

Grafana

Alertmanager

Prometheus Prometheus

“Which replica to use?”

Problem: Metric retention

Problem: Metric retention

SSD

Prometheus

PrometheusRemote write

Thanos

Goals

● Have a global view● Have a HA in place● Increase retention

Global View

See everything from a single place!

SSD

Prometheus

PrometheusTargets

SSD

Sidecar

Prometheus SidecarTargets

gRPC (Store API)

Store API

service Store {

rpc Series(SeriesRequest) returns (stream SeriesResponse);

rpc LabelNames(LabelNamesRequest) returns (LabelNamesResponse);

rpc LabelValues(LabelValuesRequest) returns (LabelValuesResponse);

}

message SeriesRequest {

int64 min_time = 1;

int64 max_time = 2;

repeated LabelMatcher matchers = 3;

}

Sidecar

Prometheus

remote read

Store API

SSD

Querier

Prometheus Sidecar

Querier

Store API

Targets

HTTP Query API

SSD

Global View

Prometheus Sidecar

Querier

Targets

SSD

SidecarTargets

Prometheus

Merge

Store API

SSD

Global View + Availability

Prometheus SidecarTargets

SSD

Sidecar

Targets

Prometheus

SSD

Sidecar Prometheus

“replica”:”1”

“replica”:”2”

QuerierMerge

Deduplicate

Store API

Thanos

Goals

● Have a global view ✓● Have a HA in place ✓

Prometheus Sidecar

SSD

Sidecar PrometheusSidecar Prometheus

Querier

Historical Metrics

What exactly happened X months ago?

TSDB Layout

Block 2 Block 4Block 3Block 1

T-10hT-16h T-4h T-2h T

TSDB Layout

Block 4Block 3Block 1

chunks chunks

chunks chunks

index

T-10hT-16h T-4h T-2h T

SSD

Data saving

Prometheus SidecarTargets

Object Storage

Blocks Blocks

Block

SSD

Data saving

Prometheus SidecarTargets

Object Storage

Blocks Blocks

Block

--storage.tsdb.max-block-duration=2h --storage.tsdb.retention=12h

Store Gateway

Object Storage

BlocksCache

Store

Querier

Store API

Thanos

Goals

● Have a global view ✓● Have a HA in place ✓● Increase retention ✓

Prometheus

Querier

Scrape EngineCompactor

Rule & Alert Engine

Prometheus

Thanos

Scrape EngineCompactor

Rule & Alert Engine

Thanos QuerierThanos Querier

Thanos Querier

Thanos

Compactor

Rule & Alert Engine

Thanos QuerierThanos Querier

Thanos Querier

SSD

Prometheus Sidecar

SSD

Prometheus Sidecar

SSD

Prometheus Sidecar

Thanos

Compactor

Thanos QuerierThanos Querier

Thanos Querier

SSD

Prometheus Sidecar

SSD

Prometheus Sidecar

SSD

Prometheus Sidecar

Thanos RulerThanos Ruler

Thanos

Thanos RulerThanos Ruler

Thanos QuerierThanos Querier

Thanos Querier

SSD

Prometheus Sidecar

SSD

Prometheus Sidecar

SSD

Prometheus SidecarGlobal Compactor

Thanos

Store GatewayStore

Gateway

Object StorageSSD

Prometheus Sidecar

SSD

Prometheus Sidecar

SSD

Prometheus Sidecar

Thanos RulerThanos Ruler

Global Compactor

Thanos QuerierThanos Querier

Thanos Querier

Deployment Models

Federation

QuerierQuerierQuerierStoreBucket

QuerierQuerierQuerier

StoreBucket

QuerierQuerierQuerierStoreBucket

Cluster A (master)

Cluster B

Cluster C

++

Federation (through Store API)++

++

Example Deployment

Cluster 1

Cluster 2

+

Cluster n

Cluster n+1

+...

+

Core Cluster

Grafana

Alertmanager

Bucket

Compactor

Querier Querier

Querier

Ruler Store

Statically configured

+

Example Global Deployment

++++

++++

++++

++++

++++

++++

Testing Staging

Production Querier Querier

Querier

Bonus: Downsampling

Downsampling

Raw: 16 bytes/sample

Compressed: 1.07 bytes/sample

Downsampling

BUT…

Downsampling

Decompressing one sample takes 10-40 nanoseconds

● Times 1000 series @ 30s scrape interval

● Times 1 year

Downsampling

Decompressing one sample takes 10-40 nanoseconds

● Times 1000 series @ 30s scrape interval

● Times 1 year

● Over 1 billion samples, i.e. 10-40s – for decoding alone

● Plus your actual computation over all those samples, e.g. rate()

Downsampling

BlockRAW

Block@ 5m

Block@ 1h

10x 12x

Downsampling

chunk

count sum min max counter

chunk...

Downsampling

count sum min max counter

count_over_time(requests_total[1h])

Downsampling

count sum min max counter

sum_over_time(requests_total[1h])

Downsampling

count sum min max counter

min(requests_total)

min_over_time(requests_total[1h])

Downsampling

count sum min max counter

max(requests_total)

max_over_time(requests_total[1h])

Downsampling

count sum min max counter

rate(requests_total[1h])

increase(requests_total[1h])

Downsampling

count sum min max counter

requests_total

avg(requests_total)

...

*

avg

Thanos

Goals

● Have a global view ✓● Have a HA in place ✓● Increase retention ✓

Any questions?

github.com/improbable-eng/thanos

Fabian Reinartz fabxc

Bartek Plotka bwplotka Bplotka

top related