Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

src: diagnostic NodeReport initial implementation #7242

Closed
wants to merge 2 commits into from
Closed

src: diagnostic NodeReport initial implementation #7242

wants to merge 2 commits into from

Conversation

rnchamberlain
Copy link

Checklist
  • make -j4 test (UNIX) or vcbuild test nosign (Windows) passes
  • a test and/or benchmark is included
  • documentation is changed or added
  • the commit message follows commit guidelines
Affected core subsystem(s)

src

Description of change

Implementation of a diagnostic report for Node, intended for devt,
test and production use, to capture and preserve information for
problem determination. The code sits behind a new option
"--nodereport-events=" which enables selective
triggering of the report on unhandled exceptions, fatal errors and
signals. The report is not enabled by default.

Content of the NodeReport in the initial implementation consists of
a header section containing the event type, timestamp, PID and Node
version, a section containing the Javascript stack trace, a section
containing V8 heap information and an OS platform information
section. Existing V8 APIs are used to obtain the stack trace and V8
heap information. There are changes in node.cc to handle the
command-line option and report triggering, and new files containing
the report writer and testcase.

Candidates for additional content in the NodeReport include: native
stack traces; detailed JS stacktraces with call parameters, stack
locals/references and code; long stack traces; libuv information; OS
and hypervisor information and levels; CPU usage; GC history;
module-specific information (inserted in the report via callback).

More information here:
nodejs/post-mortem#24
https://github.com/rnchamberlain/node/wiki/NodeReport

@nodejs/post-mortem
@bnoordhuis @mhdawson

Implementation of a diagnostic report for Node, intended for devt,
test and production use, to capture and preserve information for
problem determination. The code sits behind a new option
"--nodereport-events=<list of event types>" which enables selective
triggering of the report on unhandled exceptions, fatal errors and
signals. The report is not enabled by default.

Content of the NodeReport in the initial implementation consists of
a header section containing the event type, timestamp, PID and Node
version, a section containing the Javascript stack trace, a section
containing V8 heap information and an OS platform information
section. Existing V8 APIs are used to obtain the stack trace and V8
heap information. There are changes in node.cc to handle the
command-line option and report triggering, and new files containing
the report writer and testcase.

Candidates for additional content in the NodeReport include: native
stack traces; detailed JS stacktraces with call parameters, stack
locals/references and code; long stack traces; libuv information; OS
and hypervisor information and levels; CPU usage; GC history;
module-specific information (inserted in the report via callback).
@nodejs-github-bot nodejs-github-bot added the c++ Issues and PRs that require attention from people who are familiar with C++. label Jun 9, 2016
@addaleax
Copy link
Member

addaleax commented Jun 9, 2016

Could you run make lint over this? There are quite a few things it points out:
lint-7242.txt

@rnchamberlain
Copy link
Author

@addaleax darn, of course, will do. thanks.

@mscdex
Copy link
Contributor

mscdex commented Jun 9, 2016

/cc @nodejs/post-mortem

@@ -139,6 +139,7 @@
'src/node_main.cc',
'src/node_os.cc',
'src/node_revert.cc',
'src/node_report.cc',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add the header file down below? It makes it easier when using Xcode or other IDEs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@evanlucas ah right, will do, thanks

@mhdawson mhdawson added the post-mortem Issues and PRs related to the post-mortem diagnostics of Node.js. label Jun 9, 2016
// Check atomic for NodeReport already pending
if (__sync_val_compare_and_swap(&nodereport_signal, 0, signo) == 0) {
fprintf(stderr,"Signal %s received, triggering NodeReport\n", signo_string(signo));
Isolate::GetCurrent()->RequestInterrupt(SignalDumpInterruptCallback, &nodereport_signal);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RequestInterrupt() is not async-signal-safe, it grabs a mutex.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will need to re-think this signal handler. Maybe hand over to a dedicated thread that's waiting on a semaphore, and call the RequestInterrupt() and uv_async_send() APIs from there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably piggy-back on the watchdog thread that RegisterDebugSignalHandler() in src/node.cc starts.

@misterdjules
Copy link

@rnchamberlain In nodejs/post-mortem#24, @davepacheco had asked a few questions that seems to still be relevant. They are quoted below, along with your answers:

I'd suggest parsing the nodereport_events argument more precisely, rather than using strstr(). If other options are added later with overlapping names, the resulting behavior would be surprising.

agree, the strstr() code is temporary

However, it seems srtstr is still used to parse the value(s) of the nodereport_events command line option.

I'm not sure it's a good idea to add a handler for SIGUSR2. How does this change the behavior for existing Node programs that use SIGUSR2?

yes, the signal needs to be configurable, it's likely too late to try to grab/reserve a specific signal for Node diagnostic use

This PR still uses SIGUSR2, so the question still stands.

@bnoordhuis
Copy link
Member

How does this change the behavior for existing Node programs that use SIGUSR2?

They'll just override the NodeReport signal handler. I don't see a problem with that.

Making the signal number configurable is nevertheless a good idea.

@misterdjules
Copy link

@rnchamberlain It seems that these reports could be generated with the same info by a native add-on, is that correct? If so, what's the motivation for adding this support to Node's core instead of implementing it as a npm module?

@mhdawson
Copy link
Member

@misterdjules I'll let Richard give the full answer but I believe at least the part that triggers the generation needs to be in the core and given that providing minimum base information that will be useful in most cases makes sense. You might then want it to be possible to extend whats in the report through modules.

@rnchamberlain
Copy link
Author

How does this change the behavior for existing Node programs that use SIGUSR2?

As Ben says, existing Node programs that use SIGUSR2 would override the --nodereport handler, so no behaviour change. The PR does address this issue by allowing a choice of the signal to be used to trigger the NodeReport - SIGUSR2 or SIGQUIT. Other signals could be added (note Node currently adds handlers for SIGUSR1 for debug, and SIGINT/TERM for terminal settings).I would expect Node programs that use signals to have some documentation and/or a runtime indication, so a user wanting the report could select a different signal.

...these reports could be generated with the same info by a native add-on

As Michael says, the triggering needs to be in core, and I anticipate that some content would only be available if implemented in core. We have discussed (locally) providing a call-out so that modules could add information to the report. I understand the 'minimal kernel of core functionality' goal, but I think there is enthusiasm for more diagnostic support in Node core, eg #7059

strstr is still used to parse the value(s) of the nodereport_events command line

Yes - does need a more robust parser, e.g to provide messages to the user if there are typos/mistakes in the supplied option, and to allow for future enhancements (event names might overlap). Just this was not a priority for the PR compared with getting the report triggering and basic content available. Will be fixed.

@mhdawson
Copy link
Member

@rnchamberlain when you say "Will be fixed." do you mean you will fix as part of this PR or in a follow on ? Just wondering if this PR is ready for final review or there are more changes to come ?

@davepacheco
Copy link
Contributor

How does this change the behavior for existing Node programs that use SIGUSR2?

As Ben says, existing Node programs that use SIGUSR2 would override the --nodereport handler, so no behaviour change.

Is the assumption that nobody is using SIGUSR2 with the default behavior of terminating the program normally? I could imagine using this as part of a multi-process program, but I don't know of any that do, and I wouldn't figure it would be a great idea anyway. Still, is it worth adding this behavioral change to the release notes?

@misterdjules
Copy link

@rnchamberlain @mhdawson

the triggering needs to be in core

What specifically needs to be in core? Except for the fatal error events I fail to see why core would need to be changed to support this, but I may very well be missing something.

Fixes for PR review comments:
1. Line length, whitespace, snprintf etc fixes for lint

2. Add .h file to node.gyp

3. Re-write signal handler to hand over to separate thread. The fix
shares the debug watchdog thread code. Actually sharing a single
watchdog thread would be problematic because on semaphore wake-up
the watchdog would need to know which signal was sent, and to allow
for a debug signal arriving while a nodereport was in progress.

4. Add a parser for the --nodereport-events option, remove the use
of strstr().

5. Naming and checking improvements suggested by Ben
@rnchamberlain
Copy link
Author

See fixes for lint, signal handler, option parsing issues etc in Jun 16 commit. The whitespace changes for lint make it quite a large delta, apologies.

@davepacheco
Good point re the current default behaviour on SIGUSR2. In this initial drop we wanted to be able to trigger a NodeReport and have the application continue running. A future enhancement is to allow other diagnostic actions to be configured on the signal. So actions on SIGUSR2 (or other chosen signal) could be: NodeReport + continue/terminate + core dump/no core dump. Possibly also dump trace buffers, heapdump etc. I could bring part of that forward, and make the default for --nodereport-events=sigusr2 be NodeReport + terminate. Either way yes, needs documenting.

@misterdjules
Re core vs module. You are correct that only the fatal error triggering has to be in core (though that is one of the main use-cases). The signal handling could be outside, and I think process.on('uncaughtException..) used for the other trigger. The PR deliberately included a relatively minimal implementation, hopefully we can add other trigger points and more valuable content in the report. I think this is best done when closely integrated with the node runtime, with trigger points and detailed content adjusted with node changes to give the best diagnostic information (rather than have module code digging in from outside).

@misterdjules
Copy link

@rnchamberlain It seems even the fatal error events could be implemented by setting a custom FatalErrorCallback with v8::SetFatalErrorHandler.

I still fail to see how putting this in core now would help both maintainers and users of these tools. For instance, in theory any change to the format of the output would be a semver-major change, and would be available at the earliest in the next major version of Node.js at the time the changes are merged.

Implementing this as a module would allow maintainers to release new versions quickly and independently from Node.js' release schedule. Users would just need to update their dependency on that module to use newer releases. It seems like an easier workflow both for maintainers of node's core and users/maintainers of these features.

My bias towards not integrating this in core, at least for now, is also reinforced by the fact that it seems the design and implementation of these features are at an early stage and that we should expect a lot of changes both in the implementation and in what users consume.

@rnchamberlain
Copy link
Author

Right ho, going the npm route, thanks all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c++ Issues and PRs that require attention from people who are familiar with C++. post-mortem Issues and PRs related to the post-mortem diagnostics of Node.js.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants