how we sleep well at night using hystrix at finn.no

96
Hystrix - What did we learn? JavaZone September 2015 Hystrix cristata Audun Fauchald Strand & Henning Spjelkavik

Upload: henning-spjelkavik

Post on 15-Jan-2017

2.484 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: How we sleep well at night using Hystrix at Finn.no

Hystrix- What did we learn?

JavaZone September 2015

Hystrix cristata

Audun Fauchald Strand & Henning Spjelkavik

Page 2: How we sleep well at night using Hystrix at Finn.no
Page 3: How we sleep well at night using Hystrix at Finn.no
Page 4: How we sleep well at night using Hystrix at Finn.no

public int lookup(MapPoint p ) { return altitude(p);}

Example

Page 5: How we sleep well at night using Hystrix at Finn.no

public int lookup(MapPoint p ) { return new LookupCommand(p).execute();}

private class LookupCommand extends HystrixCommand<Integer> { final MapPoint p;

LookupCommand(MapPoint p) { super(HystrixCommandGroupKey.Factory.asKey("NorkartAltitude")); this.p = p; }

protected Integer run() throws Exception { return altitude(p); }

protected Integer getFallback() { return -1; }}

Example

Page 6: How we sleep well at night using Hystrix at Finn.no

Audun Fauchald Strand@audunstrand

Henning Spjelkavik@spjelkavik

Page 7: How we sleep well at night using Hystrix at Finn.no

AgendaWhy?Tolerance for failure - How?How to create a Hystrix CommandMonitoring and DashboardExamples from finnWhat did we learn

Page 8: How we sleep well at night using Hystrix at Finn.no

AgendaWhy?Tolerance for failure - How?How to create a Hystrix CommandMonitoring and DashboardExamples from finnWhat did we learn

Page 9: How we sleep well at night using Hystrix at Finn.no

Service A calls Service B

Page 10: How we sleep well at night using Hystrix at Finn.no

Map calls User over the networkWhat can possibly go wrong?

Page 11: How we sleep well at night using Hystrix at Finn.no

Map calls UserWhat can possibly go wrong?1. Connection refused2. Slow answer3. Veery slow answer (=never)4. The result causes an exception in

the client library

Page 12: How we sleep well at night using Hystrix at Finn.no

Map calls UserWhat can possibly go wrong?1. Connection refused => < 2 ms2. Slow answer => 5 s3. Veery slow answer => timeout4. The result causes an exception in

the client library => depends

Page 13: How we sleep well at night using Hystrix at Finn.no

Map calls UserWhat can possibly go wrong?1. Connection refused => < 2 ms2. Slow answer => 5 s3. Veery slow answer => timeout4. The result causes an exception in

the client library => depends

Fails quickly

Page 14: How we sleep well at night using Hystrix at Finn.no

Map calls UserWhat can possibly go wrong?1. Connection refused => < 2 ms2. Slow answer => 5 s3. Veery slow answer => timeout4. The result causes an exception in

the client library => depends

May kill both the server and the client

Page 15: How we sleep well at night using Hystrix at Finn.no

Map calls UserLet’s assume:

Thread pr requestResponse time - 4 sMap has 60 req/s. Fan-out to User is 2 => 120 req/s240 / 480 threads blocking

Page 16: How we sleep well at night using Hystrix at Finn.no

mobilewebN has 130 req/sLet’s assume:

Thread pr requestRandomApp has 130 req/s. Fan-out to service is 2 => 260 req/s520 / 1040 threads blocking

Page 17: How we sleep well at night using Hystrix at Finn.no

What happens in an app with 500 blocking threads?

Not much. Besides waiting. CPU is idle.If maximum-threads == 500

=> no more connections are allowedAnd what about 1040 occupied threads?

Page 18: How we sleep well at night using Hystrix at Finn.no

And where is the user after 8 s?At Youtube, Facebook or searching for cute kittens.

Page 19: How we sleep well at night using Hystrix at Finn.no

The problem we try to solve

An application with 30 dependent services - with 99.99%

uptime for each service99.99^30 = 99.7% uptime

0.3% of 1 billion requests = 3,000,000 failures

2+ hours downtime/month even if all dependencies have excellent uptime.

98%^30 = 54% uptime

99.99% = 8 sec a day; 99.7% 4 min pr day;

Page 20: How we sleep well at night using Hystrix at Finn.no

AgendaWhy?Tolerance for failure - How?How to create a Hystrix CommandMonitoring and DashboardExamples from finnOne step further

Page 21: How we sleep well at night using Hystrix at Finn.no

Control over latency and failure from dependencies

Stop cascading failures in a complex distributed system.

Fail fast and rapidly recover.

Fallback and gracefully degrade when possible.

Enable near real-time monitoring, alerting

What is Hystrix for?

Page 22: How we sleep well at night using Hystrix at Finn.no

Fail fast - don’t let the user wait!Circuit breaker - don’t bother, it’s already downFallback - can you give a sensible default, show stale data?Bulkhead - protect yourself against cascading failure

Principles

Page 23: How we sleep well at night using Hystrix at Finn.no

How?

Avoid any single dep from using up all threads

Shedding load and failing fast instead of queueing

Providing fallbacks wherever feasible

Using isolation techniques (such as bulkhead, swimlane,

and circuit breaker patterns) to limit the impact of any one

dependency.

Page 24: How we sleep well at night using Hystrix at Finn.no

Two different ways of isolationSemaphore

“at most 5 concurrent calls”only for CPU-intensive, local callsThread pool (dedicated couriers)the call to the underlying service is handled by a pooloverhead is usually not problematicdefault approach

Page 25: How we sleep well at night using Hystrix at Finn.no

Recommended book: Release it!

Page 26: How we sleep well at night using Hystrix at Finn.no

DependenciesDepends on

rxjavaarchaius (& commons-configuration)

FINN uses Constretto for configuration management, hence:

https://github.com/finn-no/archaius-constretto

Page 27: How we sleep well at night using Hystrix at Finn.no

DependenciesThere are useful addons:

hystrix-metrics-event-stream - json/http stream

hystrix-codahale-metrics-publisher (currently io.dropwizard.metrics)

(Follows the recent trend of really splitting up the dependencies - include only what you need)

Page 28: How we sleep well at night using Hystrix at Finn.no

Default propertiesQuite sensible, “fail fast”Do your own calculations of

number of concurrent requeststimeouts (99.8 percentile)...by looking at your current performance

(latency) pr request and add a little buffer

Page 29: How we sleep well at night using Hystrix at Finn.no

threadsrequests per second at peak when healthy × 99th percentile latency in seconds + some breathing room

Page 30: How we sleep well at night using Hystrix at Finn.no
Page 31: How we sleep well at night using Hystrix at Finn.no

Hystrix - part of NetflixOSSNetflix OSSHystrix - resilienceRibbon - remote callsFeign - Rest clientEureka - Service discoveryArchaius - ConfigurationKaryon - Starting point

Page 32: How we sleep well at night using Hystrix at Finn.no

Hystrix at FINN.no

Page 33: How we sleep well at night using Hystrix at Finn.no

AgendaWhy?Tolerance for failureHow to create a Hystrix CommandMonitoring and DashboardExamples from finnWhat did we learn

Page 34: How we sleep well at night using Hystrix at Finn.no

How to create a Hystrix CommandA command class wrapping the “risky” operation.- must implement run()- might implement fallback()

Since version 1.4 Observable implementation also available

Page 35: How we sleep well at night using Hystrix at Finn.no

public int lookup(MapPoint p ) { return altitude(p);}

AltitudeSearch - before

Page 36: How we sleep well at night using Hystrix at Finn.no

public int lookup(MapPoint p ) { return new LookupCommand(p).execute();}

private class LookupCommand extends HystrixCommand<Integer> {

final MapPoint p;

LookupCommand(MapPoint p) { super(HystrixCommandGroupKey.Factory.asKey("NorkartAltitude")); this.p = p; }

protected Integer run() throws Exception { return altitude(p); }}

AltitudeSearch - after

Page 37: How we sleep well at night using Hystrix at Finn.no

FAQDoes that mean I have to write a command for (almost) every remote operation in my application?

Page 38: How we sleep well at night using Hystrix at Finn.no

FAQ

YES!YES!

Page 39: How we sleep well at night using Hystrix at Finn.no

Why is it so intrusive?

But Why?

Page 40: How we sleep well at night using Hystrix at Finn.no

Hystrix-Javanica

@HystrixCommand(fallbackMethod = "defaultUser" ignoreExceptions = {BadRequestException.class}) public User getUserById(String id) { } private User defaultUser(String id) { }

Page 41: How we sleep well at night using Hystrix at Finn.no

Concurrency - The client decides

T = c.execute() synchronous

Future<T> = c.queue() asynchronousObservable<T> = c.observable() reactive streams

Page 42: How we sleep well at night using Hystrix at Finn.no

Runtime behaviour

Page 43: How we sleep well at night using Hystrix at Finn.no

Runtime behaviour

Page 44: How we sleep well at night using Hystrix at Finn.no

Runtime behaviour

Page 45: How we sleep well at night using Hystrix at Finn.no

Runtime behaviour

Page 46: How we sleep well at night using Hystrix at Finn.no

Runtime behaviour

Page 47: How we sleep well at night using Hystrix at Finn.no

Runtime behaviour

Page 48: How we sleep well at night using Hystrix at Finn.no

Runtime behaviour

Page 49: How we sleep well at night using Hystrix at Finn.no

Runtime behaviour

Page 50: How we sleep well at night using Hystrix at Finn.no

Runtime behaviour

Page 51: How we sleep well at night using Hystrix at Finn.no

Runtime behaviour

Page 52: How we sleep well at night using Hystrix at Finn.no

Runtime behaviour

Page 53: How we sleep well at night using Hystrix at Finn.no

Runtime behaviour

Page 54: How we sleep well at night using Hystrix at Finn.no

Runtime behaviour

Page 55: How we sleep well at night using Hystrix at Finn.no

Runtime behaviour

Page 56: How we sleep well at night using Hystrix at Finn.no

AgendaWhy?Tolerance for failureHow to create a Hystrix CommandMetrics, Monitoring and DashboardExamples from finnWhat did we learn

Page 57: How we sleep well at night using Hystrix at Finn.no

MetricsCircuit breaker open?Calls pr. secondExecution time?

Median, 90th, 95th and 99th percentile

Status of thread pool?Number of clients in

cluster

Page 58: How we sleep well at night using Hystrix at Finn.no

Publishing the metricsServo - Netflix metrics libraryCodaHale/Yammer/dropwizard - metrics

HystrixPlugins.registerMetricsPublisher(HystrixMetricsPublisher impl)

Page 59: How we sleep well at night using Hystrix at Finn.no

Dashboard toolset

hystrix-metrics-event-streamout of the box: servlet we use embedded jetty for thrift services

turbine-webaggregates metrics-event-stream into clusters

hystrix-dashboardgraphical interface

Page 60: How we sleep well at night using Hystrix at Finn.no

Dashboard

Page 61: How we sleep well at night using Hystrix at Finn.no

More Details

Page 62: How we sleep well at night using Hystrix at Finn.no

Thread Pools

Page 63: How we sleep well at night using Hystrix at Finn.no

Details

Page 64: How we sleep well at night using Hystrix at Finn.no

AgendaWhy?Tolerance for failureHow to create a Hystrix CommandMonitoring and DashboardExamples from finnWhat did we learn

Page 65: How we sleep well at night using Hystrix at Finn.no

Examples from Finn - Code

AltitudesearchFetch Several Profiles using collapsingOperations

Page 66: How we sleep well at night using Hystrix at Finn.no

public int lookup(MapPoint p ) { return new LookupCommand(p).execute();}

private class LookupCommand extends HystrixCommand<Integer> {

final MapPoint p; LookupCommand(MapPoint p) { super(HystrixCommandGroupKey.Factory.asKey("NorkartAltitude")); this.p = p; }

protected Integer run() throws Exception { return altitude(p); }

protected Integer getFallback() { return -1; }}

AltitudeSearch

Page 67: How we sleep well at night using Hystrix at Finn.no

Migrating a libraryCreate commandsWrap commands with

existing servicesBackwards compatibleNo flexibility

Page 68: How we sleep well at night using Hystrix at Finn.no

Examples from Finn - Code

Fetch a map pointFetch Several Profiles using collapsingOperations

Page 69: How we sleep well at night using Hystrix at Finn.no

Request Collapsing

Fetch one profile takes 10ms

Lots of concurrent requests

Better to fetch multiple profiles

Page 70: How we sleep well at night using Hystrix at Finn.no

Request Collapsing - why

decouples client model from server interface

reduces network overhead

client container/thread batches requests

Page 71: How we sleep well at night using Hystrix at Finn.no
Page 72: How we sleep well at night using Hystrix at Finn.no

Request Collapsingcreate two commands

Collapserone new() pr client request

BatchCommandone new() pr server request

Page 73: How we sleep well at night using Hystrix at Finn.no

Request CollapsingIntegrate two commands in two methods

createCommand()Create batchCommand from a list of

singlecommandsmapResponseToRequests()

Map listResponse to single resposes

Page 74: How we sleep well at night using Hystrix at Finn.no

Create Collapser

public Collapser(Query query) { this.query = query;

Page 75: How we sleep well at night using Hystrix at Finn.no

Create BatchCommand

return new BatchCommand(collapsedRequests, client);

Page 76: How we sleep well at night using Hystrix at Finn.no

create BatchCommand

@Overrideprotected HystrixCommand<Map<Query,Profile>>

createCommand(Collection<Request> collapsedRequests) { return new BatchCommand(collapsedRequests, client);}

Page 77: How we sleep well at night using Hystrix at Finn.no

mapResponseToRequests @Overrideprotected void mapResponseToRequests(

Map<Query,Profile> batchResponse, Collection<Request> collapsedRequests) {

collapsedRequests.stream().forEach(c -> c.setResponse(batchResponse.getOrDefault(

c.getArgument(), new ImmutableProfile(id) );) }

Page 78: How we sleep well at night using Hystrix at Finn.no

mapResponseToRequests @Overrideprotected void mapResponseToRequests(

Map<Query,Profile> batchResponse, Collection<Request> collapsedRequests) {

collapsedRequests.stream().forEach(c -> c.setResponse(batchResponse.getOrDefault(

c.getArgument(), new ImmutableProfile(id) );) }

Page 79: How we sleep well at night using Hystrix at Finn.no

mapResponseToRequests @Overrideprotected void mapResponseToRequests(

Map<Query,Profile> batchResponse, Collection<Request> collapsedRequests) {

collapsedRequests.stream().forEach(c -> c.setResponse(batchResponse.getOrDefault(

c.getArgument(), new ImmutableProfile(id) );) } Graceful

degradation

Page 80: How we sleep well at night using Hystrix at Finn.no

Request Collapsing - experiencesEach individual request will be slower for the

client, is that ok?10 ms operation into 100 ms window Max 110 ms for clientAverage 60 msRead documentation first!!

Page 81: How we sleep well at night using Hystrix at Finn.no

Examples from Finn - Code

Fetch a map pointFetch Several Profiles using collapsingOperations

Page 82: How we sleep well at night using Hystrix at Finn.no

Example from Finn - Operations[2015-06-31T13:37:00,485][ERROR] Forwarding to error page from request due to exception [AdCommand short-circuited and no fallback available.]com.netflix.hystrix.exception.HystrixRuntimeException: RecommendMoreLikeThisCommand short-circuited and no fallback available.at com.netflix.hystrix.AbstractCommand$16.call(AbstractCommand.java:811)

Page 83: How we sleep well at night using Hystrix at Finn.no

Error happens in productionOperations gets paged with lots of error

messages in logsThey read the logsLots or [ERROR]They restart the application

Page 84: How we sleep well at night using Hystrix at Finn.no

Learnings - operationsError messages means different things with

HystrixWhat they say, not where they occurBuilt in error recovery with circuit breakerOperations reads logs, not hystrix dashboardLots of unnecessary restarts

Page 85: How we sleep well at night using Hystrix at Finn.no

Conclusions

What did we learn

Page 86: How we sleep well at night using Hystrix at Finn.no

Experiences from Finn

Hystrix belongs client-side

Page 87: How we sleep well at night using Hystrix at Finn.no

Experiences from Finn

Nested Hystrix commands are ok

Page 88: How we sleep well at night using Hystrix at Finn.no

Experiences from Finn

Graceful degradation is a big change in mindset

Little use of proper fallback-values

Page 89: How we sleep well at night using Hystrix at Finn.no

Experiences from Finn

Tried putting hystrix in low-level http client without great success.

Page 90: How we sleep well at night using Hystrix at Finn.no

Experiences from Finn

Server side errors are detected clientside

Page 91: How we sleep well at night using Hystrix at Finn.no

Experiences from Finn

Not all exceptions are errors.

Page 92: How we sleep well at night using Hystrix at Finn.no

Experiences from Finn

RxJava needs a full rewrite… Still useful without!

Page 93: How we sleep well at night using Hystrix at Finn.no

Experiences from FINNHystrix standardises things we did before:

Nitty gritty http-client stuffTimeoutsConnection pools

Tuning thread poolsDashboardsMetrics

Page 94: How we sleep well at night using Hystrix at Finn.no

Wrap upShould you start using Hystrix?- Bulkhead and circuit-breaker - explicit timeout and error

handling is useful- DashboardsFurther readingBen Christensen, GOTO Aarhus 2013 - https://www.youtube.com/watch?v=_t06LRX0DV0Updated for QConSF2014; https://qconsf.com/system/files/presentation-slides/ReactiveProgrammingWithRx-QConSF-2014.pdf

Thanks for listening! [email protected] & [email protected]

Page 95: How we sleep well at night using Hystrix at Finn.no
Page 96: How we sleep well at night using Hystrix at Finn.no

Questions?