how we sleep well at night using hystrix at finn.no
TRANSCRIPT
Hystrix- What did we learn?
JavaZone September 2015
Hystrix cristata
Audun Fauchald Strand & Henning Spjelkavik
public int lookup(MapPoint p ) { return altitude(p);}
Example
public int lookup(MapPoint p ) { return new LookupCommand(p).execute();}
private class LookupCommand extends HystrixCommand<Integer> { final MapPoint p;
LookupCommand(MapPoint p) { super(HystrixCommandGroupKey.Factory.asKey("NorkartAltitude")); this.p = p; }
protected Integer run() throws Exception { return altitude(p); }
protected Integer getFallback() { return -1; }}
Example
Audun Fauchald Strand@audunstrand
Henning Spjelkavik@spjelkavik
AgendaWhy?Tolerance for failure - How?How to create a Hystrix CommandMonitoring and DashboardExamples from finnWhat did we learn
AgendaWhy?Tolerance for failure - How?How to create a Hystrix CommandMonitoring and DashboardExamples from finnWhat did we learn
Service A calls Service B
Map calls User over the networkWhat can possibly go wrong?
Map calls UserWhat can possibly go wrong?1. Connection refused2. Slow answer3. Veery slow answer (=never)4. The result causes an exception in
the client library
Map calls UserWhat can possibly go wrong?1. Connection refused => < 2 ms2. Slow answer => 5 s3. Veery slow answer => timeout4. The result causes an exception in
the client library => depends
Map calls UserWhat can possibly go wrong?1. Connection refused => < 2 ms2. Slow answer => 5 s3. Veery slow answer => timeout4. The result causes an exception in
the client library => depends
Fails quickly
Map calls UserWhat can possibly go wrong?1. Connection refused => < 2 ms2. Slow answer => 5 s3. Veery slow answer => timeout4. The result causes an exception in
the client library => depends
May kill both the server and the client
Map calls UserLet’s assume:
Thread pr requestResponse time - 4 sMap has 60 req/s. Fan-out to User is 2 => 120 req/s240 / 480 threads blocking
mobilewebN has 130 req/sLet’s assume:
Thread pr requestRandomApp has 130 req/s. Fan-out to service is 2 => 260 req/s520 / 1040 threads blocking
What happens in an app with 500 blocking threads?
Not much. Besides waiting. CPU is idle.If maximum-threads == 500
=> no more connections are allowedAnd what about 1040 occupied threads?
And where is the user after 8 s?At Youtube, Facebook or searching for cute kittens.
The problem we try to solve
An application with 30 dependent services - with 99.99%
uptime for each service99.99^30 = 99.7% uptime
0.3% of 1 billion requests = 3,000,000 failures
2+ hours downtime/month even if all dependencies have excellent uptime.
98%^30 = 54% uptime
99.99% = 8 sec a day; 99.7% 4 min pr day;
AgendaWhy?Tolerance for failure - How?How to create a Hystrix CommandMonitoring and DashboardExamples from finnOne step further
Control over latency and failure from dependencies
Stop cascading failures in a complex distributed system.
Fail fast and rapidly recover.
Fallback and gracefully degrade when possible.
Enable near real-time monitoring, alerting
What is Hystrix for?
Fail fast - don’t let the user wait!Circuit breaker - don’t bother, it’s already downFallback - can you give a sensible default, show stale data?Bulkhead - protect yourself against cascading failure
Principles
How?
Avoid any single dep from using up all threads
Shedding load and failing fast instead of queueing
Providing fallbacks wherever feasible
Using isolation techniques (such as bulkhead, swimlane,
and circuit breaker patterns) to limit the impact of any one
dependency.
Two different ways of isolationSemaphore
“at most 5 concurrent calls”only for CPU-intensive, local callsThread pool (dedicated couriers)the call to the underlying service is handled by a pooloverhead is usually not problematicdefault approach
Recommended book: Release it!
DependenciesDepends on
rxjavaarchaius (& commons-configuration)
FINN uses Constretto for configuration management, hence:
https://github.com/finn-no/archaius-constretto
DependenciesThere are useful addons:
hystrix-metrics-event-stream - json/http stream
hystrix-codahale-metrics-publisher (currently io.dropwizard.metrics)
(Follows the recent trend of really splitting up the dependencies - include only what you need)
Default propertiesQuite sensible, “fail fast”Do your own calculations of
number of concurrent requeststimeouts (99.8 percentile)...by looking at your current performance
(latency) pr request and add a little buffer
threadsrequests per second at peak when healthy × 99th percentile latency in seconds + some breathing room
Hystrix - part of NetflixOSSNetflix OSSHystrix - resilienceRibbon - remote callsFeign - Rest clientEureka - Service discoveryArchaius - ConfigurationKaryon - Starting point
Hystrix at FINN.no
AgendaWhy?Tolerance for failureHow to create a Hystrix CommandMonitoring and DashboardExamples from finnWhat did we learn
How to create a Hystrix CommandA command class wrapping the “risky” operation.- must implement run()- might implement fallback()
Since version 1.4 Observable implementation also available
public int lookup(MapPoint p ) { return altitude(p);}
AltitudeSearch - before
public int lookup(MapPoint p ) { return new LookupCommand(p).execute();}
private class LookupCommand extends HystrixCommand<Integer> {
final MapPoint p;
LookupCommand(MapPoint p) { super(HystrixCommandGroupKey.Factory.asKey("NorkartAltitude")); this.p = p; }
protected Integer run() throws Exception { return altitude(p); }}
AltitudeSearch - after
FAQDoes that mean I have to write a command for (almost) every remote operation in my application?
FAQ
YES!YES!
Why is it so intrusive?
But Why?
Hystrix-Javanica
@HystrixCommand(fallbackMethod = "defaultUser" ignoreExceptions = {BadRequestException.class}) public User getUserById(String id) { } private User defaultUser(String id) { }
Concurrency - The client decides
T = c.execute() synchronous
Future<T> = c.queue() asynchronousObservable<T> = c.observable() reactive streams
Runtime behaviour
Runtime behaviour
Runtime behaviour
Runtime behaviour
Runtime behaviour
Runtime behaviour
Runtime behaviour
Runtime behaviour
Runtime behaviour
Runtime behaviour
Runtime behaviour
Runtime behaviour
Runtime behaviour
Runtime behaviour
AgendaWhy?Tolerance for failureHow to create a Hystrix CommandMetrics, Monitoring and DashboardExamples from finnWhat did we learn
MetricsCircuit breaker open?Calls pr. secondExecution time?
Median, 90th, 95th and 99th percentile
Status of thread pool?Number of clients in
cluster
Publishing the metricsServo - Netflix metrics libraryCodaHale/Yammer/dropwizard - metrics
HystrixPlugins.registerMetricsPublisher(HystrixMetricsPublisher impl)
Dashboard toolset
hystrix-metrics-event-streamout of the box: servlet we use embedded jetty for thrift services
turbine-webaggregates metrics-event-stream into clusters
hystrix-dashboardgraphical interface
Dashboard
More Details
Thread Pools
Details
AgendaWhy?Tolerance for failureHow to create a Hystrix CommandMonitoring and DashboardExamples from finnWhat did we learn
Examples from Finn - Code
AltitudesearchFetch Several Profiles using collapsingOperations
public int lookup(MapPoint p ) { return new LookupCommand(p).execute();}
private class LookupCommand extends HystrixCommand<Integer> {
final MapPoint p; LookupCommand(MapPoint p) { super(HystrixCommandGroupKey.Factory.asKey("NorkartAltitude")); this.p = p; }
protected Integer run() throws Exception { return altitude(p); }
protected Integer getFallback() { return -1; }}
AltitudeSearch
Migrating a libraryCreate commandsWrap commands with
existing servicesBackwards compatibleNo flexibility
Examples from Finn - Code
Fetch a map pointFetch Several Profiles using collapsingOperations
Request Collapsing
Fetch one profile takes 10ms
Lots of concurrent requests
Better to fetch multiple profiles
Request Collapsing - why
decouples client model from server interface
reduces network overhead
client container/thread batches requests
Request Collapsingcreate two commands
Collapserone new() pr client request
BatchCommandone new() pr server request
Request CollapsingIntegrate two commands in two methods
createCommand()Create batchCommand from a list of
singlecommandsmapResponseToRequests()
Map listResponse to single resposes
Create Collapser
public Collapser(Query query) { this.query = query;
Create BatchCommand
return new BatchCommand(collapsedRequests, client);
create BatchCommand
@Overrideprotected HystrixCommand<Map<Query,Profile>>
createCommand(Collection<Request> collapsedRequests) { return new BatchCommand(collapsedRequests, client);}
mapResponseToRequests @Overrideprotected void mapResponseToRequests(
Map<Query,Profile> batchResponse, Collection<Request> collapsedRequests) {
collapsedRequests.stream().forEach(c -> c.setResponse(batchResponse.getOrDefault(
c.getArgument(), new ImmutableProfile(id) );) }
mapResponseToRequests @Overrideprotected void mapResponseToRequests(
Map<Query,Profile> batchResponse, Collection<Request> collapsedRequests) {
collapsedRequests.stream().forEach(c -> c.setResponse(batchResponse.getOrDefault(
c.getArgument(), new ImmutableProfile(id) );) }
mapResponseToRequests @Overrideprotected void mapResponseToRequests(
Map<Query,Profile> batchResponse, Collection<Request> collapsedRequests) {
collapsedRequests.stream().forEach(c -> c.setResponse(batchResponse.getOrDefault(
c.getArgument(), new ImmutableProfile(id) );) } Graceful
degradation
Request Collapsing - experiencesEach individual request will be slower for the
client, is that ok?10 ms operation into 100 ms window Max 110 ms for clientAverage 60 msRead documentation first!!
Examples from Finn - Code
Fetch a map pointFetch Several Profiles using collapsingOperations
Example from Finn - Operations[2015-06-31T13:37:00,485][ERROR] Forwarding to error page from request due to exception [AdCommand short-circuited and no fallback available.]com.netflix.hystrix.exception.HystrixRuntimeException: RecommendMoreLikeThisCommand short-circuited and no fallback available.at com.netflix.hystrix.AbstractCommand$16.call(AbstractCommand.java:811)
Error happens in productionOperations gets paged with lots of error
messages in logsThey read the logsLots or [ERROR]They restart the application
Learnings - operationsError messages means different things with
HystrixWhat they say, not where they occurBuilt in error recovery with circuit breakerOperations reads logs, not hystrix dashboardLots of unnecessary restarts
Conclusions
What did we learn
Experiences from Finn
Hystrix belongs client-side
Experiences from Finn
Nested Hystrix commands are ok
Experiences from Finn
Graceful degradation is a big change in mindset
Little use of proper fallback-values
Experiences from Finn
Tried putting hystrix in low-level http client without great success.
Experiences from Finn
Server side errors are detected clientside
Experiences from Finn
Not all exceptions are errors.
Experiences from Finn
RxJava needs a full rewrite… Still useful without!
Experiences from FINNHystrix standardises things we did before:
Nitty gritty http-client stuffTimeoutsConnection pools
Tuning thread poolsDashboardsMetrics
Wrap upShould you start using Hystrix?- Bulkhead and circuit-breaker - explicit timeout and error
handling is useful- DashboardsFurther readingBen Christensen, GOTO Aarhus 2013 - https://www.youtube.com/watch?v=_t06LRX0DV0Updated for QConSF2014; https://qconsf.com/system/files/presentation-slides/ReactiveProgrammingWithRx-QConSF-2014.pdf
Thanks for listening! [email protected] & [email protected]
Questions?