everybody lies

Download Everybody Lies

Post on 21-Jul-2015

467 views

Category:

Software

0 download

Embed Size (px)

TRANSCRIPT

  • E V E R Y B O D Y L I E ST O M A S Z K O W A L C Z E W S K I

  • C A R G O C U LT

    During the Middle Ages there were all kinds of crazy ideas, such as that a piece of of rhinoceros horn would increase potency. Then a method was discovered for separating the ideas- which was to try one to see if it worked, and if it didn't work, to eliminate it. This method became organized, of course, into science. And it developed very well, so that we are now in the scientific age. It is such a scientific age, in fact, that we have difficulty in understanding how witch doctors could ever have existed, when nothing that they proposed ever really worked-or very little of it did.

    Richard Feynman

    From a Caltech commencement address given in 1974

  • W H Y B O T H E R ?

    You get what you measure

    - Ineffective optimisations that complicate code

    + Numbers to convince management to do refactoring or migration to Java 8!

  • W H Y B O T H E R ?

    Predictable is better than fast

    One page display requires multiple calls (static and dynamic resources)

    Multiple microservices are called to generate response

    During a session user may do hundreds of displays of your webpages

  • W H Y D O T H I S ?

    Every 100 ms increase in load time of Amazon.com decreased sales by 1%1

    Increasing web search latency 100 to 400 ms reduces the daily searches per user by 0.2% to 0.6%. Furthermore, users do fewer searches the longer they are exposed. For longer delays, the loss of searches persists for a time even after latency returns to previous levels.2

    1Kohavi and Longbotham 20072Brutlag 2009

  • S U R V E Y

    Do you

  • S U R V E Y

    Use graphite?

  • S U R V E Y

    Use graphite?

    Feed it with Coda Hale/Dropwizard metrics?

  • S U R V E Y

    Use graphite?

    Feed it with Coda Hale/Dropwizard metrics?

    Modify their source? Use nonstandard options?

  • S U R V E Y

    Use graphite?

    Feed it with Coda Hale/Dropwizard metrics?

    Modify their source? Use nonstandard options?

    Graph average? Median?

  • S U R V E Y

    Use graphite?

    Feed it with Coda Hale/Dropwizard metrics?

    Modify their source? Use nonstandard options?

    Graph average? Median?

    Percentiles?

  • (c) xkcd.com

    http://xkcd.com
  • W H AT M E T R I C S C A N W E U S E ?

    graphite.send(prefix(name, "max"), ...); graphite.send(prefix(name, "mean"), ...); graphite.send(prefix(name, "min"), ...); graphite.send(prefix(name, "stddev"), ...); graphite.send(prefix(name, "p50"), ...); graphite.send(prefix(name, "p75"), ...); graphite.send(prefix(name, "p95"), ...); graphite.send(prefix(name, "p98"), ...); graphite.send(prefix(name, "p99"), ...); graphite.send(prefix(name, p999"), ...);

  • D O N T L O O K AT M E A N

    1000 queries - 0ms latency, 100 queries 5s latency

    Average is 4,5ms

    1000 queries - 1ms latency, 100 queries - 5s latency

    Average is 455ms

    Does not help to quantify lags users will experience

  • A N S C O M B E ' S Q U A R T E T B Y F R A N C I S A N S C O M B E

    These four data sets all have the same mean, median, and variance

  • P L O T T I N G M E A N I S F O R S H O W I N G O F F T O M A N A G E M E N T

  • M AY B E M E D I A N T H E N ?

    What is the probability of end user encountering latency worse than median?

    Remember: usually multiple requests are needed to respond to API call (e.g. N micro services, N resource requests per page)

    1

    2

    N 100

  • P R O B A B I L I T Y O F E X P E R I E N C I N G L AT E N C Y B E T T E R T H A N M E D I A N

    I N F U N C T I O N O F M I C R O S E R V I C E S I N V O LV E D

    0 1 2 3 4 5 6 7 8 9 10

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

  • W H I C H P E R C E N T I L E I S R E L E VA N T T O Y O U ?

    Is 99th percentile demanding constraint?

    In application serving 1000 qps latency worse than that happens ten times per second.

    User that needs to navigate through several web pages will most probably experience it

    What is the probability of encountering latency better than 99th?

    99

    100

    N 100

  • P R O B A B I L I T Y O F E X P E R I E N C I N G L AT E N C Y B E T T E R T H A N 9 9 T H P E R C E N T I L E

    I N F U N C T I O N O F M I C R O S E R V I C E S I N V O LV E D

    0 10 20 30 40 50 60 70 80 90 100

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

  • D O N O T AV E R A G E P E R C E N T I L E S

    Example scenario:

    1. Load balancer splits traffic unevenly (ELB anyone?)

    2. Server S1 has 1 qps over measured time with 95%ile == 1ms

    3. Server S2 has 100 qps over measured time with 95%ile == 10s

    4. Average is ~5s.

    5. What does that tell us?

    6. Did we satisfy SLA if it says 95%ile must be below 8s?

    7. Actual 95%ile percentile is ~10s

  • A L I C E ' S A D V E N T U R E S I N W O N D E R L A N D

    If there's no meaning in it,' said the King, 'that saves a world of trouble, you know, as we

    needn't try to find any

  • Every time you average max values someone in the world starts new JavaScript framework

  • Demo time

  • metricRegistry.timer("2015.standardTimer");

    Standard timer will over or under report actual percentiles at will.

    Green line represents actual MAX values.

  • metricRegistry.timer("2015.standardTimer");

    Standard timer will over or under report actual percentiles at will.

    Green line represents actual MAX values.

  • T I M E R S H I S T O G R A M R E S E R V O I R

    Backing storage for Timers data

    Contain statistically representative reservoir of a data stream

    Default is ExponentiallyDecayingReservoir which has many drawbacks and is source of most inaccuracies observed throughout this presentation

    Others include

    UniformReservoir, SlidingTimeWindowReservoir, SlidingTimeWindowReservoir, SlidingWindowReservoir

  • E X P O N E N T I A L LY D E C AY I N G R E S E R V O I R

    Stores 1028 random samples by default

    Assumes normal distribution of recorded values

    Many statistical tools applied in computer systems monitoring will assume normal distribution

    Be suspicious of such tools

    Why is that a bad idea?

  • -2,4 -2 -1,6 -1,2 -0,8 -0,4 0 0,4 0,8 1,2 1,6 2 2,4

    0,5

    1

    1,5

    2

    2,5

    3N O R M A L D I S T R I B U T I O N - W H Y S O U S E F U L ?

    Central limit theorem

    Chebyshev's inequality

    f (x, , ) =1

    p2

    e

    (x)2

    22

  • 10 10,5 11 11,5 12

    -0,25

    0,25

    0,5

    0,75

    1C A L C U L AT E 9 5 % I L E B A S E D O N M E A N A N D S T D . D E V.

    IFF latency values were distributed normally then we could calculate any percentile based on mean and standard deviation

    = 10ms = 1ms

    Lookup into standard normal (Z) table

    95%ile is located 1.65 std. dev. from mean

    Result is 11,65ms

  • Latency profile resembling normal distribution

  • Add spikes due to young gen GC pauses

  • Add spikes due to old gen GC pauses

  • Add spikes due to calling other services (like DB)

  • Add spikes due to: lost tcp packet retransmission, disk swapping, kernel bookkeeping etc.

  • -2,4 -2 -1,6 -1,2 -0,8 -0,4 0 0,4 0,8 1,2 1,6 2 2,4

    0,5

    1

    1,5

    2

    2,5

    3N O R M A L D I S T R I B U T I O N - W H Y N O T A P P L I C A B L E ?

    The value of the normal distribution is practically zero when the value x lies more than a few standard deviations away from the mean.

    It may not be an appropriate model when one expects a significant fraction of outliers

    [] other statistical inference methods that are optimal for normally distributed variables often become highly unreliable when applied to such data.

    1

    f (x, , ) =1

    p2

    e

    (x)2

    22

    1All quotes on this slide from Wikipedia

  • Blue line represents metric reported from Timer class Green line represents request rate

  • T I M E R , T I M E R N E V E R C H A N G E S

    Timer values decay exponentially

    giving artificial smoothing of values for server behaviour that may be long gone

    Timer that is not updated does not decay

    If Timer is not updated (e.g. subprocess failed and we stopped sending requests to it) its values will remain constant

    Check this post for potential solutions: taint.org/2014/01/16/145944a.html

    http://taint.org/2014/01/16/145944a.html
  • H D R H I S T O G R A M

    Supports recording and analysis of sampled data across configurable range with configurable accuracy

    Provides compact representation of data while retaining high resolution

    Allows configurable tradeoffs between space and accuracy

    Very fast, allocation free, not thread safe for maximum speed (thread safe versions available)

    Created by Gil Tene of Azul Sytems

  • R E C O R D E R

    Uses HdrHistogram to store values

    Supports concurrent recording of values

    Recording is lock free but also wait free on most architectures (that support lock xadd)

    Reading is not lock free but does not stall writers (writer-reader phaser)

    Checkout Marshall Pierces librar