sarah wells - alert overload: how to adopt a microservices architecture without being overwhelmed...

109
MILAN 20/21.11.2015 Alert overload: How to adopt a microservices architecture without being overwhelmed with noise Sarah Wells - Financial Times @sarahjwel ls

Upload: codemotion

Post on 12-Apr-2017

502 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

MILAN 20/21.11.2015

Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Sarah Wells - Financial Times

@sarahjwells

Page 2: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 3: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Microservices make it worse

Page 4: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

microservices (n,pl): an efficient device for transforming business problems into distributed

transaction problems

@drsnooks

Page 5: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

You have a lot more systems

Page 6: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

45 microservices

Page 7: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

45 microservices3 environments

Page 8: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

45 microservices3 environments2 instances for each service

Page 9: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

45 microservices3 environments2 instances for each service20 checks per service

Page 10: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

45 microservices3 environments2 instances for each service20 checks per servicerunning every 5 minutes

Page 11: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

> 1,500,000 system checks per day

Page 12: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Over 19,000 system monitoring alerts in 50 days

Page 13: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Over 19,000 system monitoring alerts in 50 days

An average of 380 per day

Page 14: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Functional monitoring is also an issue

Page 15: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

12,745 response time/error alerts in 50 days

Page 16: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

12,745 response time/error alerts

An average of 255 per day

Page 17: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Why so many?

Page 18: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 19: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 20: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 21: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 22: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

http://devopsreactions.tumblr.com/post/122408751191/alerts-when-an-outage-starts

Page 23: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

How can you make it better?

Page 24: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Quick starts: attack your problem

See our EngineRoom blog for more: http://bit.ly/1PP7uQQ

Page 25: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

1 2 3

Page 26: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Think about monitoring from the start

1

Page 27: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

It's the business functionality you care about

Page 28: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 29: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 30: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

1

Page 31: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

2

1

Page 32: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

3

1

2

Page 33: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

4

1

2

3

Page 34: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

We care about whether published content made it to us

Page 35: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

When people call our APIs, we care about speed

Page 36: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

… we also care about errors

Page 37: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

But it's the end-to-end that matters

https://www.flickr.com/photos/robef/16537786315/

Page 38: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

You only want an alert where you need to take action

Page 39: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

If you just want information, create a dashboard or report

Page 40: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Make sure you can't miss an alert

Page 41: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Make the alert great

http://www.thestickerfactory.co.uk/

Page 42: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Build your system with support in mind

Page 43: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Transaction ids tie all microservices together

Page 44: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 45: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Healthchecks tell you whether a service is OK

GET http://{service}/__health

Page 46: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Healthchecks tell you whether a service is OK

GET http://{service}/__health

returns 200 if the service can run the healthcheck

Page 47: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Healthchecks tell you whether a service is OK

GET http://{service}/__health

returns 200 if the service can run the healthcheck

each check will return "ok": true or "ok": false

Page 48: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 49: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 50: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Synthetic requests tell you about problems early

https://www.flickr.com/photos/jted/5448635109

Page 51: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Use the right tools for the job

2

Page 52: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

There are basic tools you need

Page 53: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

FT Platform: An internal PaaS

Page 54: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Service monitoring (e.g. Nagios)

Page 55: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Log aggregation (e.g. Splunk)

Page 56: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Graphing (e.g. Graphite/Grafana)

Page 57: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

metrics: reporters: - type: graphite frequency: 1 minute durationUnit: milliseconds rateUnit: seconds host: <%= @graphite.host %> port: 2003 prefix: content.<%= @config_env %>.api-policy-component.<%= scope.lookupvar('::hostname') %>

Page 58: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 59: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 60: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Real time error analysis (e.g. Sentry)

Page 61: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Build other tools to support you

Page 62: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

SAWS

Built by Silvano Dossan

See our Engine room blog: http://bit.ly/1GATHLy

Page 63: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

"I imagine most people do exactly what I do - create a google filter to send all Nagios emails straight to the bin"

Page 64: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

"Our screens have a viewing angle of about 10 degrees"

Page 65: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

"Our screens have a viewing angle of about 10 degrees"

"It never seems to show the page I want"

Page 66: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Code at: https://github.com/muce/SAWS

Page 67: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Dashing

Page 68: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 69: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Nagios chart

Built by Simon Gibbs

@simonjgibbs

Page 70: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 71: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 72: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 73: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 74: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Use the right communication channel

Page 75: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

It's not email

Page 76: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Slack integration

Page 77: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 78: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Radiators everywhere

Page 79: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Cultivate your alerts

3

Page 80: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Review the alerts you get

Page 81: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

If it isn't helpful, make sure you don't

get sent it again

Page 82: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

See if you can improve it

www.workcompass.com/

Page 83: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Splunk Alert: PROD - MethodeAPIResponseTime5MAlert

Business ImpactThe methode api server is slow responding to requests.This might result in articles not getting published to the new content platform or publishing requests timing out.

...

Page 84: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Splunk Alert: PROD - MethodeAPIResponseTime5MAlert

Business ImpactThe methode api server is slow responding to requests.This might result in articles not getting published to the new content platform or publishing requests timing out.

...

Page 85: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Technical ImpactThe server is experiencing service degradation because of network latency, high publishing load, high bandwidth utilization, excessive memory or cpu usage on the VM. This might result in failure to publish articles to the new content platform.

Page 86: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Splunk Alert: PROD Content Platform Ingester Methode Publish Failures Alert

There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below.

Please see the run book for more information.

_time transaction_id uuidMon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

Page 87: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Splunk Alert: PROD Content Platform Ingester Methode Publish Failures Alert

There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below.

Please see the run book for more information.

_time transaction_id uuidMon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

Page 88: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Splunk Alert: PROD Content Platform Ingester Methode Publish Failures Alert

There has been one or more publish failures to the Universal Publishing Platform. The UUIDs are listed below.

Please see the run book for more information.

_time transaction_id uuidMon Oct 12 07:43:54 2015 tid_pbueyqnsqe a56a2698-6e90-11e5-8608-a0853fb4e1fe

Page 89: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

When you didn't get an alert

Page 90: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

What would have told you about this?

Page 91: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise
Page 92: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Setting up an alert is part of fixing the problem

✔ code

✔ test

alerts

Page 93: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

System boundaries are more difficult

Severin.stalder [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons

Page 94: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Make sure you would know if an alert stopped working

Page 95: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Add a unit test

public void shouldIncludeTriggerWordsForPublishFailureAlertInSplunk() {

}

Page 96: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Deliberately break things

Page 97: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Chaos snail

Page 98: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

The thing that sends you alerts need to be up and running

https://www.flickr.com/photos/davidmasters/2564786205/

Page 99: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

What's happened to our alerts?

Page 100: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

We turned off ALL emails from system monitoring

Page 101: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Our two most important alerts come in via our team slack channel

Page 102: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

We have dashboards for our read APIs in Grafana

Page 103: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

To summarise...

Page 104: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Build microservices

Page 105: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

1 2 3

Page 106: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

About technology at the FT:

Look us up on Stack Overflowhttp://bit.ly/1H3eXVe

Read our bloghttp://engineroom.ft.com/

Page 107: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

The FT on github

https://github.com/Financial-Times/

https://github.com/ftlabs

Page 108: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Thank you!

Page 109: Sarah Wells - Alert overload: How to adopt a microservices architecture without being overwhelmed with noise

Questions?