src: diagnostic NodeReport initial implementation #7242

rnchamberlain · 2016-06-09T10:39:17Z

Checklist

make -j4 test (UNIX) or vcbuild test nosign (Windows) passes
a test and/or benchmark is included
documentation is changed or added
the commit message follows commit guidelines

Affected core subsystem(s)

src

Description of change

Implementation of a diagnostic report for Node, intended for devt,
test and production use, to capture and preserve information for
problem determination. The code sits behind a new option
"--nodereport-events=" which enables selective
triggering of the report on unhandled exceptions, fatal errors and
signals. The report is not enabled by default.

Content of the NodeReport in the initial implementation consists of
a header section containing the event type, timestamp, PID and Node
version, a section containing the Javascript stack trace, a section
containing V8 heap information and an OS platform information
section. Existing V8 APIs are used to obtain the stack trace and V8
heap information. There are changes in node.cc to handle the
command-line option and report triggering, and new files containing
the report writer and testcase.

Candidates for additional content in the NodeReport include: native
stack traces; detailed JS stacktraces with call parameters, stack
locals/references and code; long stack traces; libuv information; OS
and hypervisor information and levels; CPU usage; GC history;
module-specific information (inserted in the report via callback).

More information here:
nodejs/post-mortem#24
https://github.com/rnchamberlain/node/wiki/NodeReport

@nodejs/post-mortem
@bnoordhuis @mhdawson

Implementation of a diagnostic report for Node, intended for devt, test and production use, to capture and preserve information for problem determination. The code sits behind a new option "--nodereport-events=<list of event types>" which enables selective triggering of the report on unhandled exceptions, fatal errors and signals. The report is not enabled by default. Content of the NodeReport in the initial implementation consists of a header section containing the event type, timestamp, PID and Node version, a section containing the Javascript stack trace, a section containing V8 heap information and an OS platform information section. Existing V8 APIs are used to obtain the stack trace and V8 heap information. There are changes in node.cc to handle the command-line option and report triggering, and new files containing the report writer and testcase. Candidates for additional content in the NodeReport include: native stack traces; detailed JS stacktraces with call parameters, stack locals/references and code; long stack traces; libuv information; OS and hypervisor information and levels; CPU usage; GC history; module-specific information (inserted in the report via callback).

addaleax · 2016-06-09T10:45:14Z

Could you run make lint over this? There are quite a few things it points out:
lint-7242.txt

rnchamberlain · 2016-06-09T10:55:43Z

@addaleax darn, of course, will do. thanks.

mscdex · 2016-06-09T11:10:06Z

/cc @nodejs/post-mortem

evanlucas · 2016-06-09T12:02:43Z

node.gyp

@@ -139,6 +139,7 @@
        'src/node_main.cc',
        'src/node_os.cc',
        'src/node_revert.cc',
+        'src/node_report.cc',


can you also add the header file down below? It makes it easier when using Xcode or other IDEs

@evanlucas ah right, will do, thanks

bnoordhuis · 2016-06-09T19:34:38Z

src/node.cc

+  // Check atomic for NodeReport already pending
+  if (__sync_val_compare_and_swap(&nodereport_signal, 0, signo) == 0) {
+    fprintf(stderr,"Signal %s received, triggering NodeReport\n", signo_string(signo));
+    Isolate::GetCurrent()->RequestInterrupt(SignalDumpInterruptCallback, &nodereport_signal);


RequestInterrupt() is not async-signal-safe, it grabs a mutex.

OK, I will need to re-think this signal handler. Maybe hand over to a dedicated thread that's waiting on a semaphore, and call the RequestInterrupt() and uv_async_send() APIs from there.

You can probably piggy-back on the watchdog thread that RegisterDebugSignalHandler() in src/node.cc starts.

misterdjules · 2016-06-10T16:52:46Z

@rnchamberlain In nodejs/post-mortem#24, @davepacheco had asked a few questions that seems to still be relevant. They are quoted below, along with your answers:

I'd suggest parsing the nodereport_events argument more precisely, rather than using strstr(). If other options are added later with overlapping names, the resulting behavior would be surprising.

agree, the strstr() code is temporary

However, it seems srtstr is still used to parse the value(s) of the nodereport_events command line option.

I'm not sure it's a good idea to add a handler for SIGUSR2. How does this change the behavior for existing Node programs that use SIGUSR2?

yes, the signal needs to be configurable, it's likely too late to try to grab/reserve a specific signal for Node diagnostic use

This PR still uses SIGUSR2, so the question still stands.

bnoordhuis · 2016-06-10T16:56:46Z

How does this change the behavior for existing Node programs that use SIGUSR2?

They'll just override the NodeReport signal handler. I don't see a problem with that.

Making the signal number configurable is nevertheless a good idea.

misterdjules · 2016-06-10T18:01:19Z

@rnchamberlain It seems that these reports could be generated with the same info by a native add-on, is that correct? If so, what's the motivation for adding this support to Node's core instead of implementing it as a npm module?

mhdawson · 2016-06-10T20:26:03Z

@misterdjules I'll let Richard give the full answer but I believe at least the part that triggers the generation needs to be in the core and given that providing minimum base information that will be useful in most cases makes sense. You might then want it to be possible to extend whats in the report through modules.

rnchamberlain · 2016-06-13T13:21:25Z

How does this change the behavior for existing Node programs that use SIGUSR2?

As Ben says, existing Node programs that use SIGUSR2 would override the --nodereport handler, so no behaviour change. The PR does address this issue by allowing a choice of the signal to be used to trigger the NodeReport - SIGUSR2 or SIGQUIT. Other signals could be added (note Node currently adds handlers for SIGUSR1 for debug, and SIGINT/TERM for terminal settings).I would expect Node programs that use signals to have some documentation and/or a runtime indication, so a user wanting the report could select a different signal.

...these reports could be generated with the same info by a native add-on

As Michael says, the triggering needs to be in core, and I anticipate that some content would only be available if implemented in core. We have discussed (locally) providing a call-out so that modules could add information to the report. I understand the 'minimal kernel of core functionality' goal, but I think there is enthusiasm for more diagnostic support in Node core, eg #7059

strstr is still used to parse the value(s) of the nodereport_events command line

Yes - does need a more robust parser, e.g to provide messages to the user if there are typos/mistakes in the supplied option, and to allow for future enhancements (event names might overlap). Just this was not a priority for the PR compared with getting the report triggering and basic content available. Will be fixed.

mhdawson · 2016-06-13T15:49:46Z

@rnchamberlain when you say "Will be fixed." do you mean you will fix as part of this PR or in a follow on ? Just wondering if this PR is ready for final review or there are more changes to come ?

davepacheco · 2016-06-13T16:01:31Z

How does this change the behavior for existing Node programs that use SIGUSR2?

As Ben says, existing Node programs that use SIGUSR2 would override the --nodereport handler, so no behaviour change.

Is the assumption that nobody is using SIGUSR2 with the default behavior of terminating the program normally? I could imagine using this as part of a multi-process program, but I don't know of any that do, and I wouldn't figure it would be a great idea anyway. Still, is it worth adding this behavioral change to the release notes?

misterdjules · 2016-06-13T16:09:00Z

@rnchamberlain @mhdawson

the triggering needs to be in core

What specifically needs to be in core? Except for the fatal error events I fail to see why core would need to be changed to support this, but I may very well be missing something.

Fixes for PR review comments: 1. Line length, whitespace, snprintf etc fixes for lint 2. Add .h file to node.gyp 3. Re-write signal handler to hand over to separate thread. The fix shares the debug watchdog thread code. Actually sharing a single watchdog thread would be problematic because on semaphore wake-up the watchdog would need to know which signal was sent, and to allow for a debug signal arriving while a nodereport was in progress. 4. Add a parser for the --nodereport-events option, remove the use of strstr(). 5. Naming and checking improvements suggested by Ben

rnchamberlain · 2016-06-16T12:40:39Z

See fixes for lint, signal handler, option parsing issues etc in Jun 16 commit. The whitespace changes for lint make it quite a large delta, apologies.

@davepacheco
Good point re the current default behaviour on SIGUSR2. In this initial drop we wanted to be able to trigger a NodeReport and have the application continue running. A future enhancement is to allow other diagnostic actions to be configured on the signal. So actions on SIGUSR2 (or other chosen signal) could be: NodeReport + continue/terminate + core dump/no core dump. Possibly also dump trace buffers, heapdump etc. I could bring part of that forward, and make the default for --nodereport-events=sigusr2 be NodeReport + terminate. Either way yes, needs documenting.

@misterdjules
Re core vs module. You are correct that only the fatal error triggering has to be in core (though that is one of the main use-cases). The signal handling could be outside, and I think process.on('uncaughtException..) used for the other trigger. The PR deliberately included a relatively minimal implementation, hopefully we can add other trigger points and more valuable content in the report. I think this is best done when closely integrated with the node runtime, with trigger points and detailed content adjusted with node changes to give the best diagnostic information (rather than have module code digging in from outside).

misterdjules · 2016-06-16T17:15:21Z

@rnchamberlain It seems even the fatal error events could be implemented by setting a custom FatalErrorCallback with v8::SetFatalErrorHandler.

I still fail to see how putting this in core now would help both maintainers and users of these tools. For instance, in theory any change to the format of the output would be a semver-major change, and would be available at the earliest in the next major version of Node.js at the time the changes are merged.

Implementing this as a module would allow maintainers to release new versions quickly and independently from Node.js' release schedule. Users would just need to update their dependency on that module to use newer releases. It seems like an easier workflow both for maintainers of node's core and users/maintainers of these features.

My bias towards not integrating this in core, at least for now, is also reinforced by the fact that it seems the design and implementation of these features are at an early stage and that we should expect a lot of changes both in the implementation and in what users consume.

rnchamberlain · 2016-06-20T08:24:04Z

Right ho, going the npm route, thanks all.

nodejs-github-bot added the c++ Issues and PRs that require attention from people who are familiar with C++. label Jun 9, 2016

evanlucas reviewed Jun 9, 2016
View reviewed changes

mhdawson added the post-mortem Issues and PRs related to the post-mortem diagnostics of Node.js. label Jun 9, 2016

bnoordhuis reviewed Jun 9, 2016
View reviewed changes

rnchamberlain closed this Jun 20, 2016

mhdawson mentioned this pull request Jun 28, 2016

Next WG meeting nodejs/post-mortem#31

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src: diagnostic NodeReport initial implementation #7242

src: diagnostic NodeReport initial implementation #7242

rnchamberlain commented Jun 9, 2016

addaleax commented Jun 9, 2016

rnchamberlain commented Jun 9, 2016

mscdex commented Jun 9, 2016

evanlucas Jun 9, 2016

rnchamberlain Jun 9, 2016

bnoordhuis Jun 9, 2016

rnchamberlain Jun 10, 2016

bnoordhuis Jun 10, 2016

misterdjules commented Jun 10, 2016

bnoordhuis commented Jun 10, 2016

misterdjules commented Jun 10, 2016

mhdawson commented Jun 10, 2016

rnchamberlain commented Jun 13, 2016

mhdawson commented Jun 13, 2016

davepacheco commented Jun 13, 2016

misterdjules commented Jun 13, 2016

rnchamberlain commented Jun 16, 2016

misterdjules commented Jun 16, 2016

rnchamberlain commented Jun 20, 2016

src: diagnostic NodeReport initial implementation #7242

src: diagnostic NodeReport initial implementation #7242

Conversation

rnchamberlain commented Jun 9, 2016

Checklist

Affected core subsystem(s)

Description of change

addaleax commented Jun 9, 2016

rnchamberlain commented Jun 9, 2016

mscdex commented Jun 9, 2016

evanlucas Jun 9, 2016

Choose a reason for hiding this comment

rnchamberlain Jun 9, 2016

Choose a reason for hiding this comment

bnoordhuis Jun 9, 2016

Choose a reason for hiding this comment

rnchamberlain Jun 10, 2016

Choose a reason for hiding this comment

bnoordhuis Jun 10, 2016

Choose a reason for hiding this comment

misterdjules commented Jun 10, 2016

bnoordhuis commented Jun 10, 2016

misterdjules commented Jun 10, 2016

mhdawson commented Jun 10, 2016

rnchamberlain commented Jun 13, 2016

mhdawson commented Jun 13, 2016

davepacheco commented Jun 13, 2016

misterdjules commented Jun 13, 2016

rnchamberlain commented Jun 16, 2016

misterdjules commented Jun 16, 2016

rnchamberlain commented Jun 20, 2016