how not to measure latency - azul.com · run your measurement method against artificial systems...

70
© Copyright Azul Systems 2015 © Copyright Azul Systems 2015 @azulsystems How NOT to Measure Latency Matt Schuetze Product Management Director, Azul Systems 4/19/2016 1 South Bay (LA) Java User Group El Segundo, California

Upload: others

Post on 25-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

© Copyright Azul Systems 2015

@azulsystems

How NOT to Measure Latency

Matt Schuetze

Product Management Director, Azul Systems

4/19/2016 1

South Bay (LA) Java User Group

El Segundo, California

Page 2: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

© Copyright Azul Systems 2015

@azulsystems

Understanding Latency and Application Responsiveness

Matt Schuetze

Product Management Director, Azul Systems

4/19/2016 2

South Bay (LA) Java User Group

El Segundo, California

Page 3: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

© Copyright Azul Systems 2015

@azulsystems

The Oh $@%T! talk.

Matt Schuetze

Product Management Director, Azul Systems

4/19/2016 3

South Bay (LA) Java User Group

El Segundo, California

Page 4: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

About me: Matt Schuetze

Product Management Director at Azul Systems

Translate Voice of Customer into Zing and Zulu requirements and work items

Sing the praises of Azul efforts through product launches

Azul alternate on JCP exec committee, co-lead Detroit Java User Group

Stand on the shoulders of giants and admit it

4/19/2016 4

Page 5: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Philosophy and motivation

What do we actually care about. And why?

4/19/2016 5

Page 6: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Latency Behavior

Latency: The time it took one operation to happen

Each operation occurrence has its own latency

What we care about is how latency behaves

Behavior is a lot more than “the common case was X”

4/19/2016 6

Page 7: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015 ©2013 Azul Systems, Inc.

95%’ile

The “We only want to show good things” chart

We like to look at charts

4/19/2016 7

Page 8: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

What do you care about?

Do you :

Care about latency in your system?

Care about the worst case?

Care about the 99.99%’ile?

Only care about the fastest thing in the day?

Only care about the best 50%

Only need 90% of operations to meet requirements?

4/19/2016 8

Page 9: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015 ©2013 Azul Systems, Inc.

We like to rant about latency

4/19/2016 9

Page 10: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

99%‘ile is ~60 usec. (but mean is ~210usec)

“outliers”, “averages” and other nonsense

We nicknamed these spikes “hiccups”

4/19/2016 10

Page 11: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Dispelling standard deviation

4/19/2016 11

Page 12: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Mean = 0.06 msec Std. Deviation (𝞂) = 0.21msec

99.999% = 38.66msec

In a normal distribution,

These are NOT normal distributions

~184 σ (!!!) away from the mean

the 99.999%’ile falls within 4.5 σ

Dispelling standard deviation

4/19/2016 12

Page 13: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Is the 99%’ile “rare”?

4/19/2016 13

Page 14: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

What are the chances of a single web page view experiencing the 99%’ile latency of:

- A single search engine node?

- A single Key/Value store node?

- A single Database node?

- A single CDN request?

Cumulative probability…

4/19/2016 14

Page 15: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015 4/19/2016 15

Page 16: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Which HTTP response time metric is more “representative” of user

experience?

The 95%’ile or the 99.9%’ile

4/19/2016 16

Page 17: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Example: A typical user session involves 5 page loads, averaging 40 resources per page.

- How many of our users will NOT experience something

worse than the 95%’ile?

Answer: ~0.003%

- How many of our users will experience at least one response that is longer than the 99.9%’ile?

Answer: ~18%

Gauging user experience

4/19/2016 17

Page 18: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Classic look at response time behavior

Response time as a function of load

source: IBM CICS server documentation, “understanding response times”

Average? Max?

Median? 90%? 99.9%

4/19/2016 18

Page 19: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Hiccups are strongly multi-modal

They don’t look anything like a normal distribution

A complete shift from one mode/behavior to another

Mode A: “good”

Mode B: “somewhat bad”

Mode C: “terrible”, ...

The real world is not a gentle, smooth curve

Mode transitions are “phase changes”

4/19/2016 19

Page 20: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Proven ways to deal with hiccups

Actually characterizing latency

Requirements

Response Time Percentile plot

line

4/19/2016 20

Page 21: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Different throughputs, configurations, or other parameters on one graph

Comparing Behavior

4/19/2016 21

Page 22: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Shameless Bragging

4/19/2016 22

Page 23: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Comparing Behaviors - Actual Latency sensitive messaging distribution application: HotSpot vs. Zing

4/19/2016 23

Page 24: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Zing

A standards-compliant JVM for Linux/x86 servers

Eliminates Garbage Collection as a concern for enterprise applications in Java, Scala, or any JVM language

Very wide operating range: Used in both low latency and large scale enterprise application spaces

Decouples scale metrics from response time concerns

Transaction rate, data set size, concurrent users, heap size, allocation rate, mutation rate, etc.

Leverages elastic memory for resilient operation

4/19/2016 24

Page 25: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

What is Zing good for?

If you have a server-based Java application

And you are running on Linux

And you use using more than ~300MB of memory, perhaps 8-16 GB, on up to as high as 2TB memory,

Then Zing will likely deliver superior behavior metrics

4/19/2016 25

Page 26: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Where Zing shines

Low latency Eliminate behavior blips down to the sub-millisecond-units level

Machine-to-machine “stuff” Support higher *sustainable* throughput (one that meets SLAs)

Messaging, queues, market data feeds, fraud detection, analytics

Human response times Eliminate user-annoying response time blips. Multi-second and even fraction-of-a-second blips will be completely gone.

Support larger memory JVMs *if needed* (e.g. larger virtual user counts, or larger cache, in-memory state, or consolidating multiple instances)

“Large” data and in-memory analytics Make batch stuff “business real time”. Gain super-efficiencies.

Cassandra, Spark, Solr/Lucene, any large dataset in fast motion 4/19/2016 26

Page 27: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

An accidental conspiracy...

The coordinated omission problem

4/19/2016 27

Page 28: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

The coordinated omission problem

Common load testing example:

– each “client” issues requests at a certain rate

– measure/log response time for each request

So what’s wrong with that?

– works only if ALL responses fit within interval

– implicit “automatic back off” coordination

Begin audience participation exercise now…

4/19/2016 28

Page 29: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

It is MUCH more common than you may think...

Is coordinated omission rare?

4/19/2016 29

Page 30: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Before Correction

After Correcting

for Omission

JMeter makes this mistake... And so do other tools!

4/19/2016 30

Page 31: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Before Correction

After Correction

Wrong by 7x

Real World Coordinated Omission effects

4/19/2016 31

Page 32: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Uncorrected Data

Real World Coordinated Omission effects

4/19/2016 32

Page 33: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Uncorrected Data

Corrected for Coordinated

Omission

Real World Coordinated Omission effects

4/19/2016 33

Page 34: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

A ~2500x difference in

reported percentile levels for the problem

that Zing eliminates

Zing

“other” JVM

Real World Coordinated Omission effects Why I care

4/19/2016 34

Page 35: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Response Time vs. Service Time

Page 36: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Service Time vs. Response Time

Page 37: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Coordinated Omission

Usually

makes something that you think is a Response Time metric only represent the Service Time component

Page 38: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Response Time vs. Service Time @2K/sec

Page 39: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Response Time vs. Service Time @20K/sec

Page 40: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Response Time vs. Service Time @60K/sec

Page 41: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Response Time vs. Service Time @80K/sec

Page 42: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Response Time vs. Service Time @90K/sec

Page 43: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

How “real” people react

4/19/2016 43

Page 44: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Suggestions

Whatever your measurement technique is, test it.

Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with how you would describe that system behavior

Don’t waste time analyzing until you establish sanity

Don’t ever use or derive from standard deviation

Always measure Max time. Consider what it means...

Be suspicious.

Measure %‘iles. Lots of them.

4/19/2016 44

Page 45: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

HdrHistogram

4/19/2016 45

Page 46: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Then you need both good dynamic range and good resolution

HdrHistogram

If you want to be able to produce charts like this...

4/19/2016 46

Page 47: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015 4/19/2016 47

Page 48: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015 4/19/2016 48

Page 49: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Shape of Constant latency

10K fixed line latency 4/19/2016 49

Page 50: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Shape of Gaussian latency

10K fixed line latency with added Gaussian noise (std dev. = 5K)

4/19/2016 50

Page 51: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Shape of Random latency

10K fixed line latency with added Gaussian (std dev. = 5K) vs. random (+5K)

4/19/2016 51

Page 52: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Shape of Stalling latency

10K fixed base, stall magnitude of 50K stall likelihood = 0.00005 (interval = 100)

4/19/2016 52

Page 53: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Shape of Queuing latency

10K fixed base, occasional bursts of 500 msgs handling time = 100, burst likelihood = 0.00005

4/19/2016 53

Page 54: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Shape of Multi Modal latency

10K mode0 70K mode1 (likelihood 0.01) 180K mode2 (likelihood 0.00001)

4/19/2016 54

Page 55: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

And this what the real(?) world sometimes looks like…

4/19/2016 55

Page 56: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Real world “deductive reasoning”

4/19/2016 56

Page 57: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

http://www.jhiccup.org

4/19/2016 57

Page 58: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

jHiccup

4/19/2016 58

Page 59: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Discontinuity in Java execution

4/19/2016 59

Page 60: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Examples

4/19/2016 60

Page 61: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Oracle HotSpot ParallelGC Oracle HotSpot G1

1GB live set in 8GB heap, same app, same HotSpot, different GC

4/19/2016 61

Page 62: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Oracle HotSpot CMS Zing Pauseless GC

1GB live set in 8GB heap, same app, different JVM/GC

4/19/2016 62

Page 63: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Oracle HotSpot CMS Zing Pauseless GC

1GB live set in 8GB heap, same app, different JVM/GC- drawn to scale

4/19/2016 63

Page 64: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Zing

Low latency trading application

Oracle HotSpot (pure NewGen) Zing Pauseless GC

4/19/2016 64

Page 65: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Oracle HotSpot (pure newgen) Zing Oracle HotSpot (pure newgen)

Low latency trading application

Oracle HotSpot (pure NewGen) Zing Pauseless GC

4/19/2016 65

Page 66: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Oracle HotSpot (pure newgen) Zing

Low latency trading application – drawn to scale

Oracle HotSpot (pure NewGen) Zing Pauseless GC

4/19/2016 66

Page 67: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

jhiccup.com

hdrhistogram.com

azul.com/zing

4/19/2016 67

Key Latency Utils and Zing

Page 68: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Compulsory Marketing Pitch

4/19/2016 68

Page 69: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Azul Hot Topics

4/19/2016 69

Zing 16.01 available

2 TB Xmx

Docker

JMX

Performance

Zing for Cloud

Amazon AMIs

Mesos in R&D

Cloud Foundry on deck

Zing for Big Data

Cassandra offering

Hazelcast cert

Akka/Spark in Zing open source program

Zulu

Azure and AWS

JSE Embedded

ARM32 evals

Intel Edison evals

Page 70: How NOT to Measure Latency - azul.com · Run your measurement method against artificial systems that create hypothetical pauses scenarios. See if your reported results agree with

© Copyright Azul Systems 2015

Q&A and In Closing…

Go get some Zing today!

At very least download jHiccup.

Which is better dive bar The Office or Purple Orchid?

azul.com

4/19/2016 70

@schuetzematt