database stalls, from the ordinary to the obscure...postgresql, redis, mongodb, and amazon aurora,...
TRANSCRIPT
Database Stalls, From the Ordinary to the Obscure
Preetam Jinka (@PreetamJinka)Software Engineer
Percona Live 2017
VividCortex’s database monitoring application is the best way to improve your database performance, efficiency, and uptime. Supporting MySQL, PostgreSQL, Redis, MongoDB, and Amazon Aurora, VividCortex uses patented algorithms to reveal key insights, helping users fix performance problems before they impact customers. Say hello and see a demo, Booth #205.
We’re hiring!
3
This talk isn’t about the math.Come to the O’Reilly booth after the talk to pick up a free copy of
our book!
What is a stall?
4
5
Stalls
● Short periods when work isn’t being done
● We’re detecting stalls as short as 1 second
● We do this with zero configuration and no fixed thresholds
○ The secret sauce: we have a model.
6
We’re trying to catch small problems before they turn into bigger ones.
Little’s Law● L = λ × W● Concurrency = Throughput × Latency● Little’s Law provides a model to relate throughput and concurrency
In MySQL:● Concurrency: threads_running
○ There’s one thread per query.○ From SHOW STATUS
● Throughput: queries completed per second
7
MySQL Server Stall Example
8
More queries in progress
Fewer being completed
MySQL Server Stall Example
9
All of the stalled queries are completing after the fault ends.
Where do stalls come from?
10
● Running out of credits on EBS volumes
● MySQL query cache
● Lock contention
● A bad network cable!
● Transparent huge pages (THP)○ “If a transparent huge page isn’t available, the application will stall to let memory compaction
run to free a page.”
But we don’t really care about any of those things.
We’re focused on the work your database is doing.
11
Work-centric monitoring
12
13
Work-centric monitoring in one slide
● Focus on the work your systems are doing
● Find relationships between metrics (maybe using a model)
● Monitor what you want to optimize
● Focus on heavy hitters
● Automatically detect changes
How to respond to database stalls
14
15
Slowness is about spending time on something.
Things spend timedoing work or waiting.
16
Work
● CPU
● Disk I/O
● Various storage engine metrics
● Slow queries
○ Large scans
Waiting
● Lock contention
● Disk I/O
● Memory compaction
Walkthrough
17
18
19
20
21
22
23
Be careful about causality.
Thread states
24
Back pressure
25
26
Back pressure is about systems receiving more work than they can process.
27
28
It’s much better to handle back pressure higher up the stack.
Clients
29
APIs
Database
System
30
Low-level back pressure can cause unfair slowdowns higher up the stack.*
*Totally untested hypothesis. :)
31
32
50 ms shift
33
50 ms shift~1 sec queries stay~1 sec queries (1x)
~1 ms queries become~50 ms queries (50x)
● Rate limiting / throttling
● Use a queue to contain requests at a higher level
● Somehow prioritize some requests over others
34
Ways to deal with back pressure
35
Can you eliminate stalls?
Probably not all.
Most? Perhaps!