node interactive debugging node.js in production

103
Debugging Node.js in Production Yunong Xiao @yunongx Software Engineer Node Platform

Upload: yunong-xiao

Post on 15-Apr-2017

5.343 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: Node Interactive Debugging Node.js In Production

Debugging Node.js in ProductionYunong Xiao

@yunongx Software Engineer

Node Platform

Page 2: Node Interactive Debugging Node.js In Production

Node.js @ Netflix

❖ 65+ Million Subscribers❖ Website (netflix.com)❖ Dynamic asset packager❖ PaaS on Node❖ Internal Services

Page 3: Node Interactive Debugging Node.js In Production
Page 4: Node Interactive Debugging Node.js In Production

–Gene Kranz, Flight Director, Apollo 13

“Let's work the problem, people. Let's not make things any worse by guessing”

Page 5: Node Interactive Debugging Node.js In Production

Apply the Scientific Method

1. Construct a Hypothesis

2. Collect data

3. Analyze data and draw a conclusion

4. Repeat

Page 6: Node Interactive Debugging Node.js In Production

Production Crisis

❖ Runtime Performance

❖ Runtime Crashes

❖ Memory Leaks

Page 7: Node Interactive Debugging Node.js In Production

Netflix is “Slow”

Page 8: Node Interactive Debugging Node.js In Production

Gather Request Data

http://restify.comhttp://github.com/restify/node-restify

Observable REST Framework

Page 9: Node Interactive Debugging Node.js In Production

to the Rescue[2014-12-09T14:07:26.293Z] INFO: shakti/restify-audit/20067: handled: 200, latency=1402 (req_id=b3fa3820-7fac-11e4-8908-a5c7b70d676f, latency=1435) GET / HTTP/1.1 host: www.netflix.com -- HTTP/1.1 200 OK x-netflix.client.instance: i-057e47ef x-frame-options: DENY content-type: text/html -- req.timers: { "parseBody": 700123, "apiRpc": 701911, "render": 400031 }

Page 10: Node Interactive Debugging Node.js In Production

req.timers: { "parseBody": 700123, “apiRPC”: 301911, "render": 400031,}

On CPU

Page 11: Node Interactive Debugging Node.js In Production

CPU is Critical

❖ Node is essentially “single threaded”

❖ Cascading effect on ALL requests in process

Page 12: Node Interactive Debugging Node.js In Production

req.timers: { "parseBody": 700123, “apiRPC”: 301911, "render": 400031,}

Can’t process ANY other request for 1.1 seconds

On CPU

Page 13: Node Interactive Debugging Node.js In Production

How Much Code?

$ find . -name "*.js*" | xargs cat | wc -l

6 042 301

Page 14: Node Interactive Debugging Node.js In Production

Statistically Sample Stack Traces

Page 15: Node Interactive Debugging Node.js In Production

Snapshot What’s Currently Executing

Stacktrace: A stack trace is a report of the active stack frames at a certain point in time during the execution of a program.

> console.log(ex, ex.stack.split("\n"))ReferenceError: ex is not defined at repl:1:13 at REPLServer.defaultEval (repl.js:132:27) at bound (domain.js:254:14) at REPLServer.runBound [as eval] (domain.js:267:12) at REPLServer.<anonymous> (repl.js:279:12) at REPLServer.emit (events.js:107:17) at REPLServer.Interface._onLine (readline.js:214:10) at REPLServer.Interface._line (readline.js:553:8) at REPLServer.Interface._ttyWrite (readline.js:830:14) at ReadStream.onkeypress (readline.js:109:10)

Page 16: Node Interactive Debugging Node.js In Production

Two Problems 1) How to sample stack traces from a running

process? 2) How to do 1) without affecting the process?

Page 17: Node Interactive Debugging Node.js In Production

Linux Perf EventsPERF(1) perf Manual PERF(1)

NAME perf - Performance analysis tools for Linux

SYNOPSIS perf [--version] [--help] COMMAND [ARGS]

DESCRIPTION Performance counters for Linux are a new kernel-based subsystem that provide a framework for all things performance analysis. It covers hardware level (CPU/PMU, Performance Monitoring Unit) features and software features (software counters, tracepoints) as well.

Page 18: Node Interactive Debugging Node.js In Production

Sample Stack Traces w/ perf(1)

# perf record -F 99 -p `pgrep -n node` -g -- sleep 30[ perf record: Woken up 2 times to write data ][ perf record: Captured and wrote 0.524 MB perf.data (~22912 samples) ]

Page 19: Node Interactive Debugging Node.js In Production

Sample Stack Traceab2fee v8::internal::Heap::DeoptMarkedAllocationSites() (/apps/node/bin/a69754 v8::internal::StackGuard::HandleInterrupts() (/apps/node/bin/node)c9f13b v8::internal::Runtime_StackGuard(int, v8::internal::Object**3c793e3060bb (/tmp/perf-5382.map)3c793e3060bb (/tmp/perf-5382.map)3c793e3060bb (/tmp/perf-5382.map)3c793e3060bb (/tmp/perf-5382.map) (repeated 30 more lines)8e6b2f v8::Function::Call(v8::Local<v8::Context>, v8::Local<v8::Value>, int, v8::Local<v8::Value>*) (/apps/node/bin/node)8f2281 v8::Function::Call(v8::Local<v8::Value>, int, v8::Local<v8::Value>*) (/apps/node/bin/node)df599a node::MakeCallback(node::Environment*, v8::Local<v8::Value>,...df5ccb node::CheckImmediate(uv_check_s*) (/apps/node/bin/node)fb1597 uv__run_check (/apps/node/bin/node)fabcee uv_run (/apps/node/bin/node)dfaa50 node::Start(int, char**) (/apps/node/bin/node)7fcc3ef6876d __libc_start_main (/lib/x86_64-linux-gnu/libc-2.15.so)

Missing JS Frames

Page 20: Node Interactive Debugging Node.js In Production

Why? v8 places symbols JIT(Just in Time)

Page 21: Node Interactive Debugging Node.js In Production

node --perf_basic_prof_only_functions

“outputs the files in a format that the existing perf toolcan consume.”

Page 22: Node Interactive Debugging Node.js In Production

node --perf_basic_prof_only_functions

Available right now in Node v5.x

Coming soon to Node v4.3:https://github.com/nodejs/node/pull/3609

Page 23: Node Interactive Debugging Node.js In Production

Resultsnode 5382 cpu-clock: 3c793e38b0c1 LazyCompile:DELETE native runtime.js:349 (/tmp/perf-5382.map) 3c793e31981d Builtin:JSConstructStubGeneric (/tmp/perf-5382.map) 3c793ff2ca94 (/tmp/perf-5382.map) 3c793e98a10f LazyCompile:~AtlasClient._run /apps/node/webapp/node_modules/nf-atlas-client/lib/client/AtlasClient.js:85 (/tmp/perf-5382.map) 3c793f47de29 LazyCompile:*AtlasClient.timer /apps/node/webapp/node_modules/nf-atlas-client/lib/client/AtlasClient.js:70 (/tmp/perf-5382.map) 3c793e9eee38 LazyCompile:~fetchSingleGetCallback /apps/node/webapp/singletons/ShaktiFetcher.js:120 (/tmp/perf-5382.map) 3c793f6cffee LazyCompile:*Model.get /apps/node/webapp/node_modules/nf-models/lib/Model.js:90 (/tmp/perf-5382.map) 3c793ed3e2ad (/tmp/perf-5382.map) 3c7940e4357b Handler:ca (/tmp/perf-5382.map) 3c793f060e3c Function:~ /apps/node/webapp/node_modules/vasync/lib/vasync.js:134 (/tmp/perf-5382.map) 3c79404edbfa (/tmp/perf-5382.map) 3c79401fd3f7 (/tmp/perf-5382.map) 3c79400e307b LazyCompile:*fetchMulti /apps/node/webapp/singletons/ShaktiFetcher.js:50 (/tmp/perf-5382.map) 3c793fb9a59f LazyCompile:*fetch /apps/node/webapp/singletons/ShaktiFetcher.js:32 (/tmp/perf-5382.map) 3c793e896697 (/tmp/perf-5382.map) 3c7943aaabbe (/tmp/perf-5382.map) 3c793ef4c53c Function:~ /apps/node/webapp/node_modules/vasync/lib/vasync.js:245 (/tmp/perf-5382.map) 3c793eaf4f01 LazyCompile:* /apps/node/webapp/node_modules/nf-packager/lib/index.js:194 (/tmp/perf-5382.map) 3c793eab130a LazyCompile:processImmediate timers.js:352 (/tmp/perf-5382.map) 3c793e319f7d Builtin:JSEntryTrampoline (/tmp/perf-5382.map) 3c793e3189e2 Stub:JSEntryStub (/tmp/perf-5382.map) a65baf v8::internal::Execution::Call(v8::internal::Isolate*, v8::internal::Handle<v8::internal::Object>, v8::internal::Handle<v8::internal::Object>, int, v8::internal::Handle<v8::internal::Object>*, bool) (/apps/node/bin/node) 8e6b2f v8::Function::Call(v8::Local<v8::Context>, v8::Local<v8::Value>, int, v8::Local<v8::Value>*) (/apps/node/bin/node) 8f2281 v8::Function::Call(v8::Local<v8::Value>, int, v8::Local<v8::Value>*) (/apps/node/bin/node) df599a node::MakeCallback(node::Environment*, v8::Local<v8::Value>, v8::Local<v8::Function>, int, v8::Local<v8::Value>*) (/apps/node/bin/node) df5ccb node::CheckImmediate(uv_check_s*) (/apps/node/bin/node) fb1597 uv__run_check (/apps/node/bin/node) fabcee uv_run (/apps/node/bin/node) dfaa50 node::Start(int, char**) (/apps/node/bin/node) 7fcc3ef6876d __libc_start_main (/lib/x86_64-linux-gnu/libc-2.15.so))

JS Frames

Native Frames

Page 24: Node Interactive Debugging Node.js In Production

Problem: Too Many Traces

$ cat out.nodestacks01 | grep cpu-clock | wc -l

744$ wc -l out.nodestacks01

58116

Page 25: Node Interactive Debugging Node.js In Production

Too Many Traces

Page 26: Node Interactive Debugging Node.js In Production

Solution: Flame Graphs

Page 27: Node Interactive Debugging Node.js In Production

Flamegraph

❖ Each box presents a function in the stack (stack frame)

❖ x-axis: percent of time on CPU❖ y-axis: stack depth❖ colors: random, or can be a

dimension❖ https://github.com/

brendangregg/FlameGraph

v8

libc

JS

built ins

Page 28: Node Interactive Debugging Node.js In Production

Flame Graph Interpretation

a()

b() h()

c()

d()

e() f()

g()

i()

Page 29: Node Interactive Debugging Node.js In Production

Flame Graph InterpretationTop edge shows who is running on-CPU, and how much (width)

a()

b() h()

c()

d()

e() f()

g()

i()

Page 30: Node Interactive Debugging Node.js In Production

Flame Graph InterpretationTop-down shows ancestry

e.g., from g():

h()

d()

e()

i()

a()

b()

c()

f()

g()

Page 31: Node Interactive Debugging Node.js In Production

Flame Graph Interpretation

a()

b() h()

c()

d()

e() f()

g()

i()

Widths are proportional to presence in samples

e.g., comparing b() to h() (incl. children)

Page 32: Node Interactive Debugging Node.js In Production
Page 33: Node Interactive Debugging Node.js In Production
Page 34: Node Interactive Debugging Node.js In Production

> 50% time on CPU

Page 35: Node Interactive Debugging Node.js In Production

lodash!

Page 36: Node Interactive Debugging Node.js In Production

function merge(object) { var args = arguments, length = 2;...

Page 37: Node Interactive Debugging Node.js In Production

Use _.assign() Instead

Page 38: Node Interactive Debugging Node.js In Production

Before

Page 39: Node Interactive Debugging Node.js In Production

After

Page 40: Node Interactive Debugging Node.js In Production

Flame Graphs

Helps you find 1 LoC out of 6 Million

Page 41: Node Interactive Debugging Node.js In Production

Results

❖ Dramatically reduced request latency

❖ Reduced CPU utilization

❖ Increased throughput

Page 42: Node Interactive Debugging Node.js In Production

Runtime Performance Technique

❖ Sample stack traces via perf(1)

❖ Visualize code distribution with CPU flame graphs

❖ Identify candidate code paths for performance improvement

❖ Repeat

Page 43: Node Interactive Debugging Node.js In Production

Runtime Crashes

Page 44: Node Interactive Debugging Node.js In Production

- Chafin, R. "Pioneer F & G Telemetry and Command Processor Core Dump Program." JPL Technical Report XVI, no. 32-1526 (1971): 174.

“The method described in this article was designed to provide a core dump… with a minimal impact

on the spacecraft… as the resumption of data acquisition from the spacecraft is the highest

priority.”

Page 45: Node Interactive Debugging Node.js In Production

Core Dumps — A Brief History

❖ Magnetic core memory❖ Dump out the contents of

“core” memory for debugging❖ “Core dump” was born❖ Initially printed on paper!❖ Postmortem debugging was

born!

Page 46: Node Interactive Debugging Node.js In Production
Page 47: Node Interactive Debugging Node.js In Production

Production Constraints

❖ Uptime is critical

❖ Not easily reproducible

❖ Can’t simulate environment

❖ Resume normal operations ASAP

Page 48: Node Interactive Debugging Node.js In Production

Postmortem Debugging

Take core dump

Restart app

Load core dump

elsewhere

Engineer FixDebug

Continue serving traffic

Page 49: Node Interactive Debugging Node.js In Production

Configure Node to Dump Core on Error

!"[0] <> node --abort_on_uncaught_exception throw.jsUncaught Error

FROMObject.<anonymous> (/Users/yunong/throw.js:1:63)Module._compile (module.js:435:26)Object.Module._extensions..js (module.js:442:10)Module.load (module.js:356:32)Function.Module._load (module.js:311:12)Function.Module.runMain (module.js:467:10)startup (node.js:134:18)node.js:961:3

[1] 4131 illegal hardware instruction (core dumped) node --abort_on_uncaught_exception throw.js

Page 50: Node Interactive Debugging Node.js In Production

Node Post Mortem Tooling

❖ Netflix uses Linux in Prod

❖ Linux — Work in progress

❖ https://github.com/tjfontaine/lldb-v8

❖ https://github.com/indutny/llnode

❖ Solaris — Full featured, compatible with Linux cores

❖ https://github.com/joyent/mdb_v8

Page 51: Node Interactive Debugging Node.js In Production
Page 52: Node Interactive Debugging Node.js In Production

Socks & Duct Tape: Setup a Debug Solaris Instance

EC2: http://omnios.omniti.com/wiki.php/Installation#IntheCloud

VM: http://omnios.omniti.com/wiki.php/Installation#Quickstart

Page 53: Node Interactive Debugging Node.js In Production

Post Mortem Methodology

❖ Where: Inspect stack trace

❖ Why: Inspect heap and stack variable state

Page 54: Node Interactive Debugging Node.js In Production

mdb(1) JS commands❖ ::help <cmd>

❖ ::jsstack

❖ ::jsprint

❖ ::jssource

❖ ::jsconstructor

❖ ::findjsobjects

❖ ::jsfunctions

Page 55: Node Interactive Debugging Node.js In Production

Load the Core Dump

# mdb ./node-v4.2.2-linux/node-v4.2.2-linux-x64/bin/node ./core.7186

> ::load ./mdb_v8_amd64.somdb_v8 version: 1.1.1 (release, from 28cedf2)V8 version: 143.156.132.195Autoconfigured V8 support from targetC++ symbol demangling enabled

linux node binary core dumpload mdb_v8 module

Page 56: Node Interactive Debugging Node.js In Production

::jsstack> ::jsstackjs: testjs: storeHeaderjs: <anonymous> (as OutgoingMessage._storeHeader)js: <anonymous> (as ServerResponse.writeHead)js: restifyWriteHeadjs: _cbjs: sendjs: <anonymous> (as <anon>)js: <anonymous> (as ReactRenderer._renderLayout)js: <anonymous> (as <anon>)js: <anonymous> (as <anon>)js: <anonymous> (as dispatchHandler)js: <anonymous> (as <anon>)js: runHooksjs: runTransitionToHooksjs: <anonymous> (as assign.to)js: <anonymous> (as <anon>)js: runHooksjs: runTransitionFromHooksjs: <anonymous> (as assign.from)js: <anonymous> (as React.createClass.statics.dispatch)native: _ZN2v88internalL6InvokeEbNS0_6HandleINS0_10JSFunctionEEENS1_INS0...native: v8::internal::Execution::Call+0xc8native: v8::internal::Runtime_Apply+0x1cejs: <anonymous> (as b)

frame type

func name

Page 57: Node Interactive Debugging Node.js In Production

Always name your functions!var foo = function foo() {};

Foo.prototype.bar = function bar() {};

foo(function bar() {});

Page 58: Node Interactive Debugging Node.js In Production

::jsstack -v Frame Source> ::jsstack -vjs: storeHeader file: http.js posn: position 18774 this: 2ad561306c91 (<unknown>) arg1: 3bd67e0669b9 (JSObject: ServerResponse) arg2: 3dfe966ae299 (JSObject: Object) arg3: 34d5391d8859 (SeqAsciiString) arg4: 34d5391d8881 (SeqAsciiString)

652 653 function storeHeader(self, state, field, value) { 654 // Protect against response splitting. The if statement is there to 655 // minimize the performance impact in the common case. 656 if (/[\r\n]/.test(value)) 657 value = value.replace(/[\r\n]+[ \t]*/g, ''); 658 659 state.messageHeader += field + ': ' + value + CRLF; 660 661 if (connectionExpression.test(field)) { 662 state.sentConnectionHeader = true; 663 if (closeExpression.test(value)) { 664 self._last = true; 665 } else { 666 self.shouldKeepAlive = true; 667 } 668 669 } else if (transferEncodingExpression.test(field)) {

Page 59: Node Interactive Debugging Node.js In Production

::jsstack -vn0 Frame and Function Args> ::jsstack -vn0js: test file: native regexp.js posn: position 2677 this: 2421205bd4d9 (JSRegExp) arg1: 34d5391d8859 (SeqAsciiString)js: storeHeader file: http.js posn: position 18774 this: 2ad561306c91 (<unknown>) arg1: 3bd67e0669b9 (JSObject: ServerResponse) arg2: 3dfe966ae299 (JSObject: Object) arg3: 34d5391d8859 (SeqAsciiString) arg4: 34d5391d8881 (SeqAsciiString)js: <anonymous> (as OutgoingMessage._storeHeader) file: http.js posn: position 15652 this: 3bd67e0669b9 (JSObject: ServerResponse) arg1: 3dfe966ae271 (ConsString) arg2: 3dfe966add99 (JSObject: Object)js: restifyWriteHead file: /apps/node/webapp/node_modules/restify/lib/response.js posn: position 6964 this: 3bd67e0669b9 (JSObject: ServerResponse) (1 internal frame elided)js: _cb file: /apps/node/webapp/node_modules/restify/lib/response.js

Func NameJS FileLine #

Func Args

Page 60: Node Interactive Debugging Node.js In Production

::jsstack Function Args> ::jsstack -vn0js: test file: native regexp.js posn: position 2677 this: 2421205bd4d9 (JSRegExp) arg1: 34d5391d8859 (SeqAsciiString)js: storeHeader file: http.js posn: position 18774 this: 2ad561306c91 (<unknown>) arg1: 3bd67e0669b9 (JSObject: ServerResponse) arg2: 3dfe966ae299 (JSObject: Object) arg3: 34d5391d8859 (SeqAsciiString) arg4: 34d5391d8881 (SeqAsciiString)js: <anonymous> (as OutgoingMessage._storeHeader) file: http.js posn: position 15652 this: 3bd67e0669b9 (JSObject: ServerResponse) arg1: 3dfe966ae271 (ConsString) arg2: 3dfe966add99 (JSObject: Object)js: restifyWriteHead file: /apps/node/webapp/node_modules/restify/lib/response.js posn: position 6964 this: 3bd67e0669b9 (JSObject: ServerResponse) (1 internal frame elided)js: _cb file: /apps/node/webapp/node_modules/restify/lib/response.js

Memory Address of Var Var Type

Page 61: Node Interactive Debugging Node.js In Production

::jsprint Print JS Objects> 3bd67e0669b9::jsprint{ "_time": 1437690472539, "_headers": { "content-type": "text/html", "req_id": "5b7f18f2-7f12-4c68-b07f-3cd75698ba65", "set-cookie": “CENSORED; Domain=.netflix.com; Expires=Fri, 24 Jul 2015 10:27:52 GMT", "x-frame-options": "DENY", "x-ua-compatible": "IE=edge", "x-netflix.client.instance": "i-c420596c", }, "output": [], "_last": false, "_hangupClose": false, "_hasBody": true, "socket": { "_connecting": false, "_handle": [...], "_readableState": [...], "readable": true, "domain": null, "_events": [...], "_maxListeners": 10, "_writableState": [...], "writable": true, "allowHalfOpen": true, "onend": function <anonymous> (as socket.onend),

Actual JS Object Instance

Page 62: Node Interactive Debugging Node.js In Production

::jsconstructor Show Object Constructor

> 3bd67e0669b9::jsconstructor -vServerResponse (JSFunction: 2421205bced9)

Page 63: Node Interactive Debugging Node.js In Production

::jssource Print f() Source

> 2421205bced9::jssourcefile: http.js

1066 function ServerResponse(req) { 1067 OutgoingMessage.call(this); 1068 1069 if (req.method === 'HEAD') this._hasBody = false; 1070 1071 this.sendDate = true; 1072 1073 if (req.httpVersionMajor < 1 || req.httpVersionMinor < 1) { 1074 this.useChunkedEncodingByDefault = chunkExpression.test(req.headers.te); 1075 this.shouldKeepAlive = false; 1076 } 1077 } 1078 util.inherits(ServerResponse, OutgoingMessage);

Page 64: Node Interactive Debugging Node.js In Production

Core Dump === Complete Process State

Page 65: Node Interactive Debugging Node.js In Production

Memory Leaks

Page 66: Node Interactive Debugging Node.js In Production

Memory Leaks

Page 67: Node Interactive Debugging Node.js In Production
Page 68: Node Interactive Debugging Node.js In Production
Page 69: Node Interactive Debugging Node.js In Production

Generate Core Dump Ad-hoc

Page 70: Node Interactive Debugging Node.js In Production

gcore(1) GNU Tools gcore(1)

NAME gcore - Generate a core file for a running process

SYNOPSIS gcore [-o filename] pid

Page 71: Node Interactive Debugging Node.js In Production

Take a Core Dump!root@demo:~# gcore `pgrep node`[Thread debugging using libthread_db enabled]Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".[New Thread 0x7facaeffd700 (LWP 5650)][New Thread 0x7facaf7fe700 (LWP 5649)][New Thread 0x7facaffff700 (LWP 5648)][New Thread 0x7facbc967700 (LWP 5647)][New Thread 0x7facbd168700 (LWP 5617)][New Thread 0x7facbd969700 (LWP 5616)][New Thread 0x7facbe16a700 (LWP 5615)][New Thread 0x7facbe96b700 (LWP 5614)]0x00007facbea5b5a9 in syscall () from /lib/x86_64-linux-gnu/libc.so.6Saved corefile core.5602

Page 72: Node Interactive Debugging Node.js In Production

Problem: Find Leaking Objects

Page 73: Node Interactive Debugging Node.js In Production

::findjsobjects

NAME findjsobjects - find JavaScript objects

SYNOPSIS [ addr ] ::findjsobjects [-vb] [-r | -c cons | -p prop]

Page 74: Node Interactive Debugging Node.js In Production

::findjsobjects Find ALL JS Objects on Heap

> ::findjsobjects OBJECT #OBJECTS #PROPS CONSTRUCTOR: PROPS ... 3dfe97453121 18 6721 Array 157a020e01 1304 101 <anonymous> (as Constructor): ... 8f1a53211 13879 12 ReactDOMComponent: _tag, tagName, props, ... 8f1a05691 85776 2 Array 3dfe97451a99 36 5589 Array 23e5d7d44351 1 218020 Object: .2f5hpw2hgjk.1.0.3, ... 8f1a05f31 40533 6 <anonymous> (as ReactElement): type, ... 8f1a04da1 252133 1 Array 8f1a04dc1 125869 7 Array 8f1a04f01 114914 8 Array 8f1a04d39 230924 7 Module: id, exports, parent, filename, ...

Page 75: Node Interactive Debugging Node.js In Production

Memory Leak Strategy

❖ Look at objects on heap for suspicious objects

❖ Take successive core dumps and compare object counts

❖ Growing object counts are likely leaking

❖ Inspect object for more context

❖ Walk reverse references to find root object

Page 76: Node Interactive Debugging Node.js In Production

Look at Object Delta Between Successive Core Dumps

Page 77: Node Interactive Debugging Node.js In Production

Uptime = 45mins

> ::findjsobjects OBJECT #OBJECTS #PROPS CONSTRUCTOR: PROPS ... 8f1a04d39 230924 7 Module: id, exports, parent, filename, ...

Page 78: Node Interactive Debugging Node.js In Production

Uptime = 90 mins

> ::findjsobjects OBJECT #OBJECTS #PROPS CONSTRUCTOR: PROPS ... 8f1a04d39 323454 7 Module: id, exports, parent, filename, ...

Page 79: Node Interactive Debugging Node.js In Production

Analyze Leaked Objects

Page 80: Node Interactive Debugging Node.js In Production

Representative Object

> ::findjsobjects OBJECT #OBJECTS #PROPS CONSTRUCTOR: PROPS ... 8f1a04d39 323454 7 Module: id, exports, parent, filename, ...

Representative Object, 1 of 323454

Page 81: Node Interactive Debugging Node.js In Production

Look Closer> 8f1a04d39::jsprint{ "id": "/apps/node/webapp/ui/js/pages/akiraClient.js", "exports": {}, "parent": { "id": "/apps/node/webapp/middleware/autoClientStrings.js", "exports": function autoExposeClientStrings, "parent": [...], "filename": "/apps/node/webapp/middleware/autoClientStrings.js", "loaded": true, "children": [...], "paths": [...], }, "filename": "/apps/node/webapp/ui/js/pages/akiraClient.js",

Page 82: Node Interactive Debugging Node.js In Production

Use ::findjsobjects to Find All “Module” Objects

> 8f1a04d39::findjsobjects8f1a04d393fd996bffb393fd996bfcff13fd996bfbac13fd996bf8a193fd996bf79493fd996bf3ce93fd996bf0f193fd996bead713fd996bea8213fd996bea0013fd996be92b13fd996be73d13fd996be58d13fd996bd88b13fd996bcb4593fd996bcaa413fd996bc70093fd996bc3321

Page 83: Node Interactive Debugging Node.js In Production

Analyze All 320K+ Objects?

Page 84: Node Interactive Debugging Node.js In Production

Custom Querying With Pipes and Unix Tools

8f1a04d39::findjsobjects | ::jsprint ! grep filename | sort | uniq -c

Page 85: Node Interactive Debugging Node.js In Production

Results... 1 "filename": "/apps/node/webapp/ui/js/akira/components/messaging/paymentHold.js", 2 "filename": "/apps/node/webapp/ui/js/common/commonCore.js", 1 "filename": "/apps/node/webapp/ui/js/common/playPrediction/playPrediction.js", 3 "filename": "/apps/node/webapp/ui/js/common/presentationTracking/presentationTracking.js", 111061 "filename": “/apps/node/webapp/ui/js/common/playPrediction/playPrediction.js", 7103 "filename": “/apps/node/webapp/ui/js/pages/reactClientRender.js", 111061 "filename": “/apps/node/webapp/ui/js/pages/akiraClient.js", 118257 "filename": “/apps/node/webapp/middleware/autoClientStrings.js",... Client Side Modules

Page 86: Node Interactive Debugging Node.js In Production

What’s holding on to these modules?

Page 87: Node Interactive Debugging Node.js In Production

Aim: Find Root Object

Page 88: Node Interactive Debugging Node.js In Production

Walk Reverse Refs with ::findjsobjects -r

> 8f1a04d39::findjsobjects -r

8f1a04d39 referred to by 14fd6c5b13c1.parent

Page 89: Node Interactive Debugging Node.js In Production

Root Object> 1f313791bb41::jsprint[ { "id": "/apps/node/webapp/ui/js/pages/akiraClient.js", "exports": [...], "parent": [...], "filename": "/apps/node/webapp/ui/js/pages/akiraClient.js", "loaded": false, "children": [...], "paths": [...], }, { "id": "/apps/node/webapp/ui/js/pages/akiraClient.js", "exports": [...], "parent": [...], "filename": "/apps/node/webapp/ui/js/pages/akiraClient.js", "loaded": false, "children": [...], "paths": [...], }, { "id": "/apps/node/webapp/ui/js/pages/akiraClient.js", "exports": [...], "parent": [...], "filename": "/apps/node/webapp/ui/js/pages/akiraClient.js",

Page 90: Node Interactive Debugging Node.js In Production

Spot the Leakvar cache = {};

function checkCache(someModule) { var mod = cache[someModule]; if (!mod) { try { mod = require(someModule); cache[someModule] = mod; return mod; } catch (e) { return {}; } }

return mod;}

Module could be client only, must catch

Should cache the fact we caught an exception here

Page 91: Node Interactive Debugging Node.js In Production

Root Cause

❖ Node caches metadata for each module

❖ If require process throws an exception, the module metadata is leaked (bug?)

❖ Client side module meant we were throwing during every request, and not caching the fact we tried to require it

❖ Each request leaks 3+ module metadata objects

Page 92: Node Interactive Debugging Node.js In Production

Memory Leaks

❖ Take successive core dumps (gcore(1))

❖ Compare object counts (::findjsobjects)

❖ Growing objects are likely leaking

❖ Inspect object for more context (::jsprint)

❖ Walk reverse references to find root obj (::findjsobjects -r)

Page 93: Node Interactive Debugging Node.js In Production

Post Mortem Debugging is Critical to Large Scale Prod Node Deployments

Page 94: Node Interactive Debugging Node.js In Production

More State than Just Logs❖ Detailed stack trace (::jsstack)

❖ Function args for each frame (::jsstack -vn0)

❖ Get state of any object and its provenance (::jsprint, ::jsconstructor)

❖ Get source code of any function (::jssource)

❖ Find arbitrary JS objects (::findjsobjects)

❖ Unmodified Node binary!

Page 95: Node Interactive Debugging Node.js In Production

Production Failures are Inevitable

Page 96: Node Interactive Debugging Node.js In Production

But We Can Learn From Them

Page 97: Node Interactive Debugging Node.js In Production

Production Debugging❖ Runtime Performance

❖ CPU profiling/flame graphs

❖ Runtime Crashes

❖ Inspect program state with core dumps and mdb

❖ Memory leaks

❖ Analyze objects and references with core dumps and mdb

Page 98: Node Interactive Debugging Node.js In Production

Use the Scientific Method

Page 99: Node Interactive Debugging Node.js In Production

Epilogue — State of Tooling

❖ Join Working Group https://github.com/nodejs/post-mortem

❖ Help make mdb_v8 cross platform https://github.com/joyent/mdb_v8

❖ Contribute to https://github.com/tjfontaine/lldb-v8 and https://github.com/indutny/llnode

Page 100: Node Interactive Debugging Node.js In Production

Acknowledgements❖ mdb_v8

❖ Dave Pacheco, TJ Fontaine, Julien Gilli, Bryan Cantrill

❖ CPU Profiling/Flamegraphs

❖ Brendan Gregg, Google V8 team, Ali Ijaz Sheikh

❖ Linux Perf

❖ Ingo Molnar, Arnaldo Carvalho de Melo, Namhyung Kim, Jiri Olsa, Peter Zijlstra

❖ lldb-v8

❖ TJ Fontaine

❖ llnode

❖ Fedor Indutny

Page 101: Node Interactive Debugging Node.js In Production

Get Involved!

Page 102: Node Interactive Debugging Node.js In Production

THANKS

❖ Questions? We’re Hiring!❖ [email protected]❖ @yunongx

Page 103: Node Interactive Debugging Node.js In Production

Citations

❖ Slides 29-32 used with permission from “Java Mixed-Mode Flame Graphs”, Brendan Gregg, Oct 2015

❖ Slide 26 used with permission from http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html