mongodb performance in theory and practice

MongoDB Performance{ in theory and practice }

Baron Schwartz • June 2017

Introduction

{email: “[email protected]”,tweet: “@xaprb”,slides: true

}

What You’ll Learn From This Talk

• A clear definition of performance• How to measure and analyze performance with profiles• Types of performance problems and their solutions• The MongoDB performance instrumentation

What is Performance?

What is Performance?

There are two perspectives on performance.

• Users care about request performance.• Service owners care about serving the load with minimal

resources.

The Zen of Performance

The Zen of Performance says you can’t understand user or server performance in isolation.

• User behavior influences the servers• Users affect other users• The server and resource behavior affects users

The system is more than the service/servers. Users are part of it.

Learning By Testing Assumptions

• From a partial view of the system, you can often calculate the rest.• When you measure it instead, you find reconciliation errors.• You can learn a lot from this.

The User’s Perspective

What Do Users Want?

• Users care about request performance.• “I want my answer, and I want it fast.”

•How to measure it: latency• Also called response time and residence time• Users care about each individual request’s latency.• They care if latency is consistent.

The User’s View Of A Request

request

response

residence time

(latency)

What to measure?• Latency

Users Care About Their Own Performance

• Performance, from the user’s point of view, is singular.• Users make one request at a time.

• Users don’t care about other users.• They don’t know about other users!• They don’t care about other users’ performance!

• They don’t care about the server!• To the user, the server/service is a black box that’s supposed to just work.

The Service Owner’s Perspective

What Does the Service Owner Want?

• Understanding server performance is harder than understanding request performance.•We can’t think about performance of requests in isolation.• We have to consider a system of requests and resources,

and how they interact.• Requests influence each other through contention for

shared resources.• The resulting behavior is complex, not simple.

The Secret Life Of A Request

request

queue server

response

queue time

service time

residence time

(latency)

What to measure?• Latency• Queue time• Service time

The Secret Life Of Many Requests

queue server

How To Think About Performance

What is Performance? (Cont’d)

• Performance is the interaction of two parts of an intricate system.

• User’s perspective: request-focused.• Requests are the unit of work.• Definition: request latency, with units of seconds/request.

• Service owner’s perspective: resource-focused.• Resources are what does the work.• Definition: system throughput, with units of requests/second.

• Performance is both, simultaneously.• Notice that the definitions are inverses.

Many Things Aren’t Really Performance

• Performance is NOT what you’ve probably been told!• It isn’t CPU utilization. CPU is a resource, meant to be utilized!• It isn’t cache hit ratio.• It isn’t lock time ratio.• It isn’t load average.• It isn’t the number of write tickets in use or yields or connections or...• It isn’t anything else, other than request latency or throughput.

A Clear Performance Definition Solves Problem #1

• The first challenge in solving performance problems is clearly identifying the problem.

• By understanding performance in terms of latency and throughput, you are already winning.

• Cache ratios, utilization, etc are all causes or effects of performance/behavior.• They might be a symptoms of a

performance problem.• But the symptom isn’t the problem itself.• Treating it as such causes false

correlations and other mistakes.

What’s a Performance Problem?

A performance problem is when...

a) A request has high latencyb) A service/server uses too many resources to produce throughput

That is literally it. Everything is one of these two.

How Can You Measure Performance?

• You cannot improve what you cannot measure.• As engineers, we must measure all requests, so every user has a

good experience.• The challenge is that we cannot inspect every request individually, so we

need to examine them in aggregates.• We can find poor performance by looking at these aggregates in

the right ways.

Aggregating Requests

queue server At request completion, record every fact we know about the request: timestamp T, count 1, latency N, queue time N, service time N, and other information (user, type, host...).

log

log

How To Analyze Performance With Profiles

Use Profiles To Analyze Requests

• Profiling is the most important performance analysis tool.• Profiling enables slice-and-dice drilldown into system behavior.• Profiling is ideally interactive and multi-dimensional.• As we’ll see later, I advocate profiling the server by queries.• Then, after identifying a query, profiling that query to figure out where it

spends its time.

A Profile Subdivides And Sorts The Whole

Name Rank Aggregate Time Aggregate CountSub-Item 1 1 932.1 401Sub-Item 2 2 67.8 892Sub-Item 3 3 32.0 11<The Rest> 612.9 1837

A Profile Subdivides And Sorts The Whole

Name Rank Aggregate Time Aggregate CountSub-Item 1 1 932.1 401Sub-Item 2 2 67.8 892Sub-Item 3 3 32.0 11<The Rest> 612.9 1837

Queries,Stages of Execution

How To Detect And Diagnose Performance Problems

Common Causes of Performance Problems

Performance problems happen because:

1. A request executes slowly2. A system receives too many requests3. An app/user makes useless requests

Each of these is easy to see if you’re measuring requests (queries).

Performance Troubleshooting FlowchartProfile

SUM(latency), sort

SUM(count)

Time-consuming, frequent?

Profile SUM(count),

sort SUM(latency)

No

Check for simple speedups (e.g.

indexes)

Yes

Frequent, slow?

Check for needless queries (e.g. N+1)

Yes

Profile SUM(count),

sort SUM(count)

Frequent, useless?

Check for “driver junk” (e.g. ping)

Yes

No

Done

No

Problem #1: Request Too Slow

• Requests that are too slow are spending too much time.• The solution is to measure where they spend their time.• It could be working (service time). What are the stages and timings?• Or it could be waiting (queue time)• You need to sub-profile the query to find out!

• This sounds too simple and obvious, but it’s rarely done.• Instead, people often jump to Google and “tuning the config...”

• If you know a request is slow but don’t have deep profiling?• You’ll have to use profiling tools that show a deeper level of visibility• You can see stage-by-stage timings in the MongoDB Profiler or explain()

• e.g. COLLSCAN, IXSCAN, FETCH

Solution #1: Profile By Sum of Latency

• Profile requests to find the most time-consuming ones.• Rank by sum of latency.• This finds very fast but frequent ones, not just slow ones!

• Then sort by average latency.• This finds the slowest time-consuming queries• These are often the easiest to fix! (it’s harder to optimize fast ones)

• You’ll usually find queries with obvious problems:• queries that are missing indexes• queries that are spending time locked• queries that are doing complex computation

Problem #2: Too Many Requests

• A frequent anti-pattern is shipping lots of data over the network.• The N+1 pattern is an example:• Do 1 query to find something out• Do N queries to fetch N documents matching the 1• Do the calculations in the app• It’s better to do an aggregation/subquery/join/etc in the DB instead!

• Another example is cacheable queries (repetitive, redundant)• Another is check, recheck, recheck...• “Running queries in the app” is usually better left to the DB.

Solution #2: Profile by Sum of Count

• This will rank the most frequent queries to the top.• As a second step, sort within these, by sum of latency.• This will rank the slowest ones (the worst frequent queries) first.

• The problem and solution are usually app-dependent and obvious.

Problem #3: Useless Requests

• Lots of useless requests happen in database drivers and ORMs• Examples:• Query, but no cursor fetch (see also “prepare but no execute” in RDBMS)• Ping before every query (look-before-you-leap pattern)• Update data that might not exist, “just in case”• Reset connection settings constantly, or set to defaults

• These are obvious if you sort and rank by SUM(count).

Learning The MongoDB Performance Instrumentation

MongoDB’s Performance Instrumentation

• The MongoDB slow query log• The MongoDB system profiler• The MongoDB top() command• mongotop

• The MongoDB serverStatus()• The MongoDB diagnostic.data

• The MongoDB db.currentOp()• The MongoDB explain() command• Externally captured performance measurements

MongoDB’s Slow Query Log

• MongoDB logs slow queries to the server log.• “Slow” is defined by operationProfiling.slowOpThresholdMs• I don’t like slow query log analysis. It ends badly.

Benefits Drawbacks• It’s built-in and already available.• Log analysis plays well with lots of

existing tools.

• Doesn’t capture fast-but-frequent queries that add a lot of load.

• Adds overhead to the server and latency to queries.

• Log analysis quickly becomes a manual, labor-intensive process.

• Doesn’t have much detail.• Only 1-millisecond resolution.

MongoDB’s System Profiler

• The system.profile is a per-DB capped collection of slow queries.• Uses operationProfiling.slowOpThresholdMs by default.• https://docs.mongodb.com/manual/reference/database-profiler/

Benefits Drawbacks• It’s built-in.• It’s internal and has richer

execution data about queries (per-stage timings).

• Doesn’t capture fast-but-frequent queries that add a lot of load.

• Adds overhead to the server.• The global threshold affects the

log too.• Not easy to analyze in aggregate.

Enabling MongoDB’s Profiler

• The Profiler is off by default, so you have to enable it.• db.setProfilingLevel(level, slowms)• level: 0=off, 1=slow, 2=all• 2nd param optional; overrides operationProfiling.slowOpThresholdMs

• When enabled, every command that matches the profiling spec is written to the system.profile capped collection.• Whether to leave it enabled all the time is up for debate.• Enabling the Profiler turns reads into writes.• Not enabling it allows you to develop performance problems you have no

capacity to diagnose.

Using the MongoDB Profiler

• Each profiled query becomes a document in system.profile• The profiled query is the ‘query’ or ‘command’ field• The rest is stats/information about the operation’s execution

• For profiling purposes, aggregation is important• Ideally query shape, but that’s not easy• As a compromise, {ns, op} works OK• More on this later

• Key things to examine:• We’re trying to figure out where time is spent, so timings matter most• Second-order priority is throughput (counts of things)

Important Parts of Profiler Documents

• Timing-Related Fields:• millis• execStats.stage and executionTimeMillisEstimate

• See also child inputStage for sub-profiling• locks.timeAcquiringMicros

• Throughput Counts:• keysExamined• docsExamined• hasSortStage• nModified• numYield• locks.{various counts}

Sample Profiler Command

db.system.profile.aggregate({ $group : {

_id : {ns: "$ns", op: "$op"},count:{$sum:1},millis: {$sum:"$millis"}

}},{$sort: {millis: -1}});

There’s actually a better way to do this with “top”.

MongoDB’s top() command

• db.adminCommand("top") gives per-collection stats• These include counts and timings, broken out by dimensions• The mongotop program polls and sorts into a profile• https://docs.mongodb.com/manual/reference/command/top/

Benefits Drawbacks• It’s built-in and always-on.• It’s low-overhead.• It’s a great collection profiling data

source.• It’s easy to aggregate.

• It’s only collection profiling, not query profiling.

collStats

• If you don’t want to get stats about every single collection, you can query the collection’s stats one by one:• https://docs.mongodb.com/manual/reference/operator/aggregati

on/collStats/#latency-stats-document• You can also get histograms of the latency stats.

db.foo.aggregate([ { $collStats: { latencyStats: { histograms: true } } } ] )

MongoDB’s db.serverStatus()

• db.serverStatus() returns a long list of stats/metrics/counters and other information about current server state.• mongostat displays some of the counters over time.• Most of this data, plus some OS-level, is in diagnostic.data• https://docs.mongodb.com/manual/reference/command/serverSt

atus/

Benefits Drawbacks• It’s built-in.• It has a lot of timing data, not just

counters.

• Much of the data is vanity metrics unless you have a hypothesis it can support or reject.

• It’s global-scoped, not per-collection or per-query-shape etc.

Some Important Parts of serverStatus()

• backgroundFlushing.total_ms• dur.timeMS and its child fields shows where time is spent

journaling• locks.<type>.timeAcquiringMicros shows queueing• metrics.getLastError.wtime.totalMillis

MongoDB’s currentOp()

• db.currentOp() shows what’s happening right now.• It’s analogous to “SHOW PROCESSLIST” in MySQL ;-)

• Repeatedly polling this can give a coarse-grained idea of activity over time, but this is not a best practice.

Benefits Drawbacks• It’s built-in.• It has lock wait data.

• Like slow logs, this ends badly.• Doesn’t give accurate profiles.• Misses fast operations.

Warnings

• Ignore Fast Queries At Your Own Risk• The default behavior of only measuring slow queries is risky.• It ignores early warning signs of soon-to-be-serious problems.• By the time a frequent, bad query exceeds 100ms you’re in trouble.

• Fall In Love With Vanity Metrics At Your Own Risk• “This counter is large, what does it mean, could it be a problem?”• <Googles>• “Hmm, maybe I need to increase the widget cache size”• <Weeks Pass>• Remember the finger pointing at the moon!

MongoDB’s explain()

• You can explain() a query to find out how it might/did execute.• With the proper verbosity level, you get Profiler-like detail.• docs.mongodb.com/manual/reference/method/db.collection.expl

ain/

Benefits Drawbacks• It’s flexible and you can do it

individually per-query• You can get a lot of data about

query execution and timings

• Really only suitable for one-off work; doesn’t help profile a huge set of queries.

Summary of Profiling in MongoDB

Remember: Ideally Profile By Task, Then Drill Down

• The ideal top-level method of profiling is a query profile• Aggregates per-query-shape and ranks ”hot spot” query shapes at top

• Once you’ve identified query shapes of interest, then drill in• Use system.profile data• Use explain()• Look for per-stage timings• As a fall-back, look for operations you know to be expensive (e.g. yields)

MongoDB Has Limited Support For Profiling

• The server’s job is to serve requests, so ideally we’d measure them and profile them.• Unfortunately, query profiling isn’t really possible in MongoDB.• Using the built-in functionality, the best bet is collection profiling.• Find the hot/expensive collections with mongotop.• Then turn on the profiler for that database and drill into queries.• This is still a bit tedious and manual.• MongoDB’s management tools offer some productivity helpers.

Aggregating and Profiling by Query Shape

• Categorize requests by query shape• A “query shape” is the abstract/digest of the query, without params/values• Aggregate queries together by shape• This results in 1 line in the profile per type of query

• Example:• db.collection1.find({a: 1}) => db.collection1.find({a: ?})• db.collection1.find({a: 2}) => db.collection1.find({a: ?})• db.collection1.find({a: 1, b: 2}) => db.collection1.find({a: ?, b: ?})

• Profile by aggregating on the shape’s digest (checksum)• Pivot the profile to answer different questions

A Sample Query-Level Profile

External Profiling Options

• In addition to getting performance data from inside the server, you can measure it externally.• TCP packet capture and inspection is best. Three options:• mongoreplay

docs.mongodb.com/manual/reference/program/mongoreplay/ • VividCortex’s free sniffer tool

https://www.vividcortex.com/resources/topic/free-tools• VividCortex’s commercial product https://vividcortex.com

• Disadvantages:• Lacks some visibility into server internals that you get with Profiler.• May not work on hosted solutions (e.g. ObjectRocket).• Ease of use varies.

What About Application or Cluster Profiling?

• In real applications, you need to understand the entire data tier• This is infeasible with manual, server-by-server inspection• One option is APM tools, or application-level instrumentation• The problem is you get a visibility gap.• What the app thinks the database is doing is usually badly wrong.• ”Out-of-band” traffic to the database is vital to measure.• (Think Tableau, manual/adhoc queries, backups, cron jobs...)

• Most databases add more instrumentation as they mature; MongoDB is no exception. This is getting easier over time.

Conclusions

Conclusions

• Your servers are for doing useful work (requests), so measure it!• Performance is best defined in terms of requests and resources• Performance is about latency and throughput• Utilization, backlog, etc are second-order, derived metrics• Other things may be “vanity metrics” unless there’s a specific use.

• Get the full picture, but start with throughput and latency• Measure every single request, if you can; sampling = bias = trouble• Measure every one, but analyze aggregates / populations

• Profiling (aggregating, ranking, drilling down) is essential• Profiling by time is 90% of what’s needed 90% of the time• There’s some support for this in MongoDB, but not complete

Thanks!

Email me anytime [email protected] me up @xaprb or linkedin.com/in/xaprb

My ebooks ^^^^ on performance theory:vividcortex.com/resources

mongodb performance in theory and practice

Technology