understanding application hiccups - and what you can do about them
DESCRIPTION
In this presentation, Gil Tene (CTO, Azul Systems) will introduce simple, non-obstrusive methods for measuring and characterizing platform "hiccups" during application execution. Using the new jHiccup open source tool, Gil will demonstrate and chart commonly observed behaviors of idle, mostly idle, and busy systems, as well as common workload types that experience outliers due to garbage collection pauses and other runtime-induced delays. After demonstrating how simple, non-obtrusive measurement can establish a clear "best case" baseline for any expected application responsiveness, Gil will discuss the important things to look for in such measurements, as well as the common pitfalls experienced early in characterization attempts.TRANSCRIPT
©2011 Azul Systems, Inc.
Understanding Application Hiccups
and what you can do about them
An introduction to the Open Source jHiccup tool
Gil Tene, CTO & co-Founder, Azul Systems
©2011 Azul Systems, Inc.
About me: Gil Tene
co-founder, CTO @Azul Systems
Have been working on “think different” GC approaches since 2002
Created Pauseless & C4 core GC algorithms (Tene, Wolf)
A Long history building Virtual & Physical Machines, Operating Systems, Enterprise apps, etc... * working on real-world trash compaction issues, circa 2004
©2011 Azul Systems, Inc.
About AzulWe make scalable Virtual Machines
Have built “whatever it takes to get job done” since 2002
3 generations of custom SMP Multi-core HW (Vega)
Now Pure software for commodity x86 (Zing)
“Industry firsts” in Garbage collection, elastic memory, Java virtualization, memory scale
Vega
C4
©2011 Azul Systems, Inc.
A classic look at response time behavior
Key Assumption: Response time is a function of load
source: IBM CICS server docuementation, “understanding response times”
Average?Max?
Median?90%?99.9%
©2011 Azul Systems, Inc.
Common fallacies
Computers run application code continuouslyCPUs stop processing application code for all sorts of reasons
e.g: Interrupts. Scheduling of other work, swapping, etc.
Modern system architectures add more: Power management, Virtualization (cross-image context switching, physical VM motion), Garbage Collection, etc.
Response time can be measured as work units/time.
Response time exhibits a normal distributionLeading to attempts to represent with average + std. deviation
©2011 Azul Systems, Inc.
Response time over time
When we measure behavior over time, we often see:
source: ZOHO QEngine White Paper: performance testing report analysis
9
Does the response time meet my target requirements? Response Time Graphs Response time is one of the most important characteristics of load testing. Response time reports and graph measures the web user experience as it indicates how long the user waits for the server to respond for his request. This is the time taken, in seconds, to receive full response from the server. It is equivalent to the time taken by the client to connect to the server and receive the response including image, script and stylesheet. Response Time Graph (Overall) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.
Interpreting Results From this graph it is possible to identify peaks in the response time over a period of time. Helps to identify issues when server is continuously loaded for long duration of time. Response Time Graph (per Transaction) Response Time versus Elapsed Time report indicates the average response time of the transactions over the elapsed time of the load test as shown.
“Hiccups”
©2011 Azul Systems, Inc.
Application Hiccups
Where do they come from?Usually a factor of outside of the individual transaction work
E.g: Queueing, accumulated work, platform inconsistency
Do they matter?That depends. What are your end-user’s expectations?
Hiccups often dominate response time behavior
How can/should we measure them?Average? Max? 99.9%? Mean with Std. deviation?
Hiccup magnitude is often not a function of load
©2011 Azul Systems, Inc.
0"
1000"
2000"
3000"
4000"
5000"
6000"
7000"
8000"
9000"
0" 20" 40" 60" 80" 100" 120" 140" 160" 180"
Hiccup
&Dura*
on&(m
sec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
What happened here?
Source: Gil running an idle program and suspending it five times in the middle
“Hiccups”
©2011 Azul Systems, Inc.
Pitfall: Calulating %’iles “naïvely”
Common Example:build/buy simple load tester to measure throughput
issue requests one by one at a certain rate
measure and log response time for each request
results log used to produce histograms, percentiles, etc.
So what’s wrong with that?works well only when all responses fit within in rate interval
technique includes “automatic backoff” and coordination
But requirements interested in random, uncoordinated requests
Bad how bad can this get, really?
Example of naïve %’ile
1 msec
System Stalledfor 100 Sec
Elapsed Time
System easily handles 100 requests/sec
Responds to eachin 1msec
How would you characterize this system?Naïve results:
10,000 @ 1msec 1 @ 100 second
Naïve characterization: 99.99% below 1 sec !!!
Proper measurement
System Stalledfor 100 Sec
Elapsed Time
System easily handles 100 requests/sec
Responds to eachin 1msec
10,000 resultsVarying linearlyfrom 100 secto 10 msec
10,000 results@ 1 msec each
Proper characterization: 50% below 1 second
©2011 Azul Systems, Inc.
jHiccup
A tool for capturing and displaying platform hiccupsRecords any observed non-continuity of the underlying platform
plots results in simple, consistent format
Simple, non-intrusiveAs simple as adding the word “jHiccup” to your java launch line
% jHiccup java myflags myApp
Adds a background thread that samples time @ 1000/sec
Open Sourcereleased to the public domain, creative commons CC0
©2011 Azul Systems, Inc.
what jHiccup measuresjHiccup measures the platform, not the application
Experiences whatever delays application threads would see
Highlights inconsistency in platform execution of “nothing”
No attempt to identify cause (e.g. scheduling, GC, swapping, etc.)
An application cannot behave better than it’s platformIf the platform stalls, the application stalled as well.
jHiccup provides a lower bound for “best application behavior”
Useful control measurementsComparing jHiccup results for an idle application running on the same platform, at the same time, narrows down behavior to process-specific artifacts (-c option)
©2011 Azul Systems, Inc.
The anatomy of Hiccup Charts
©2011 Azul Systems, Inc.
Telco App Example
0"
20"
40"
60"
80"
100"
120"
140"
0" 500" 1000" 1500" 2000" 2500"
Hiccup
&Dura*
on&(m
sec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"0"
20"
40"
60"
80"
100"
120"
140"
Hiccup
&Dura*
on&(m
sec)&
&&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Hiccups"by"Percen?le" SLA"
Optional SLA plotting
Max Time per interval
Hiccup duration at percentile
levels
©2011 Azul Systems, Inc.
0"
200"
400"
600"
800"
1000"
1200"
0" 200" 400" 600" 800" 1000" 1200" 1400"
Hiccup
&Dura*
on&(m
sec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=967.68&
0"
200"
400"
600"
800"
1000"
1200"
Hiccup
&Dura*
on&(m
sec)&
&&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
©2011 Azul Systems, Inc.
How jHiccup worksBackground thread, measures time to do “nothing”
Sleeps for 1 milliseconds, allocates 1 object, records time gap
Collects detailed “floating point” histogram of observed times
Constant, small memory footprint, virtually no cpu load
No effect on application threads
Outputs two filesA time-based log with 1 line per interval (default 5 sec interval).
A percentile distribution log of accumulated histogram data
Spreadsheet imports files and plots dataA standard, simple chart format
Some fun with jHiccup
©2011 Azul Systems, Inc.
Examples
Idle App on Busy System
0"
10"
20"
30"
40"
50"
60"
0" 100" 200" 300" 400" 500" 600" 700" 800" 900"
Hiccup
&Dura*
on&(m
sec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=49.728&
0"
10"
20"
30"
40"
50"
60"
Hiccup
&Dura*
on&(m
sec)&
&&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Idle App on Quiet System
0"
5"
10"
15"
20"
25"
0" 100" 200" 300" 400" 500" 600" 700" 800" 900"
Hiccup
&Dura*
on&(m
sec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=22.336&
0"
5"
10"
15"
20"
25"
Hiccup
&Dura*
on&(m
sec)&
&&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Idle App on Quiet System
0"
5"
10"
15"
20"
25"
0" 100" 200" 300" 400" 500" 600" 700" 800" 900"
Hiccup
&Dura*
on&(m
sec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=22.336&
0"
5"
10"
15"
20"
25"
Hiccup
&Dura*
on&(m
sec)&
&&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Idle App on Dedicated System
0"
0.05"
0.1"
0.15"
0.2"
0.25"
0.3"
0.35"
0.4"
0.45"
0" 100" 200" 300" 400" 500" 600" 700" 800" 900"
Hiccup
&Dura*
on&(m
sec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=0.411&
0"
0.05"
0.1"
0.15"
0.2"
0.25"
0.3"
0.35"
0.4"
0.45"
Hiccup
&Dura*
on&(m
sec)&
&&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
A 1GB Java Cache under load
0"
500"
1000"
1500"
2000"
2500"
3000"
3500"
4000"
0" 500" 1000" 1500" 2000"
Hiccup
&Dura*
on&(m
sec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
Max=3448.832&
0"
500"
1000"
1500"
2000"
2500"
3000"
3500"
4000"
Hiccup
&Dura*
on&(m
sec)&
&&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Telco App
0"
200"
400"
600"
800"
1000"
1200"
1400"
1600"
1800"
0" 200" 400" 600" 800" 1000" 1200" 1400" 1600" 1800"
Hiccup
&Dura*
on&(m
sec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=1665.024&
0"
200"
400"
600"
800"
1000"
1200"
1400"
1600"
1800"
Hiccup
&Dura*
on&(m
sec)&
&&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Fun with jHiccup
©2011 Azul Systems, Inc.
Obviously, we can use it forcomparison purposes
Oracle HotSpot CMS, 1GB in an 8GB heap
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500"
Hiccup
&Dura*
on&(m
sec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=13156.352&
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
Hiccup
&Dura*
on&(m
sec)&
&&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Zing 5, 1GB in an 8GB heap
0"
5"
10"
15"
20"
25"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500"
Hiccup
&Dura*
on&(m
sec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"
Max=20.384&
0"
5"
10"
15"
20"
25"
Hiccup
&Dura*
on&(m
sec)&
&&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Oracle HotSpot CMS, 1GB in an 8GB heap
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500"
Hiccup
&Dura*
on&(m
sec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=13156.352&
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
Hiccup
&Dura*
on&(m
sec)&
&&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Zing 5, 1GB in an 8GB heap
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500"
Hiccup
&Dura*
on&(m
sec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%" 99.9999%"Max=20.384&0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
Hiccup
&Dura*
on&(m
sec)&
&&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Oracle HotSpot CMS, 4GB in a 18GB heap
0"
5000"
10000"
15000"
20000"
25000"
30000"
35000"
40000"
45000"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500"
Hiccup
&Dura*
on&(m
sec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=40173.568&
0"
5000"
10000"
15000"
20000"
25000"
30000"
35000"
40000"
45000"
Hiccup
&Dura*
on&(m
sec)&
&&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
Oracle HotSpot CMS, 1GB in an 8GB heap
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
0" 500" 1000" 1500" 2000" 2500" 3000" 3500"
Hiccup
&Dura*
on&(m
sec)&
&Elapsed&Time&(sec)&
Hiccups&by&Time&Interval&
Max"per"Interval" 99%" 99.90%" 99.99%" Max"
0%" 90%" 99%" 99.9%" 99.99%" 99.999%"
Max=13156.352&
0"
2000"
4000"
6000"
8000"
10000"
12000"
14000"
Hiccup
&Dura*
on&(m
sec)&
&&
Percen*le&
Hiccups&by&Percen*le&Distribu*on&
©2011 Azul Systems, Inc.
Q & Ahttp://www.azulsystems.com
http://www.azulsystems.com/dev_resources/jhiccup