Proposal for diagnostic NodeReport #24

rnchamberlain · 2016-04-15T12:49:47Z

This is a proposal for a human-readable summary dump/report, written to disk on the various failure scenarios. It is intended to preserve information for initial problem determination/triage, and for use
in combination with core dumps when more detailed analysis is needed. Configuration is via command line option(s)

Initial implementation is here:
https://github.com/rnchamberlain/node/compare/master...node-report-3

Documentation and sample output here:
https://github.com/rnchamberlain/node/wiki/NodeReport

Any feedback appreciated!

bnoordhuis · 2016-04-15T15:21:31Z

Diff for easier viewing: https://github.com/rnchamberlain/node/compare/master...node-report-1

davepacheco · 2016-04-15T22:11:08Z

Thanks for this! This looks useful.

For what it's worth, we've been bitten in the past by code that attempts to run after arbitrarily failures like this. For our use cases, I don't think we would typically take on the risk of such code failing or modifying the state in the core file. For the case of --abort-on-uncaught-exception, we really want to execute as little as possible between the failure and the fatal signal handler. It looks like the new reporting behavior is opt-in, so that would just work.

A few questions and comments on the change:

Is the expectation that this behavior would be triggered only by fatal failure, or is it possible for non-fatal failure to result in a report?
Is it expected that this would be merged into V8, or maintained as a separate patch to Node? Do you know if the V8 team is receptive to these changes (e.g., exposing the previously-private HeapStats class)?
Can you explain the macro-assembler-x64.cc change?
I'd suggest parsing the nodereport_events argument more precisely, rather than using strstr(). If other options are added later with overlapping names, the resulting behavior would be surprising.
On that note: I would consider more precise names. What's the difference between "error" and "exception"?
I'm not sure it's a good idea to add a handler for SIGUSR2. How does this change the behavior for existing Node programs that use SIGUSR2?
Do you think the output file should include the pid and timestamp, so that if multiple processes in the same directory crash, they're less likely to clobber each others' reports? I remember the JVM produces similar files in some cases. How are those named?

Out of curiosity, what was the motivation for this change? (We would typically build something that produced a report from the core file instead, but I imagine there may be use cases where that doesn't work?)

Thanks again.

rnchamberlain · 2016-04-18T15:57:17Z

@davepacheco many thanks for the feedback, appreciated.

Yes, we have to be very careful not to make things worse. For javascript problems like unhandled exception or javascript heap out of memory, the process should still be running fine. So as long as we stay in C++ code and don't try to run any javascript we should be OK. For native crashes/out of memory we found we could still do this type of report from the JVM, but we had to be very careful to avoid multiple dumps, potential hangs and secondary crashes (eg avoid locks and malloc, use a SIGSEGV handler and bail out if it triggers). Ideally the core dump is produced first, but that is hard to do on Linux because there is no OS 'dump me' API.

re the questions

not just for fatal failures, other triggers could be the signal, an API call, or a trace point etc
hopefully merged into V8, I think they are receptive to improvements that make life easier for V8 embedders. The internal "HeapStats" interface is a bit strange, it has a set of pointers to integers etc, so quite verbose to use. A better option would be to extend the existing external V8 class "HeapStatistics", if we can do that
the macro-assembler-x64.cc change is a hack to allow the PrintStack call to work in the signal trigger case (it needs the top frame pointer), I am still investigating that
agree, the strstr() code is temporary
agree, and the event names need to make sense to a JS developer. In the prototype, 'exception' means unhandled JS exception and 'error' means internal error, i.e. anything that calls OnFatalError() in Node.
yes, the signal needs to be configurable, it's likely too late to try to grab/reserve a specific signal for Node diagnostic use
yes, pid and timestamp need to be added to the file name. A nice way to do that is as tokens, eg a file name of "NodeReport.%t,%pid.text", as /proc/sys/kernel/core_pattern provides for core dumps (and was done for the JVM dumps)

Motivation is to provide something that users can easily/immediately read. Running a debugger or other tool on a core dump is a bit of a hurdle for some. The core dump is the primary post-mortem diagnostic, this is intended as a useful (optional) extra resource. Experience with the Java equivalent was very positive. There are some things that are easier to do at runtime (eg native/C stack traces with symbols, as the libraries are immediately available)

rnchamberlain · 2016-10-25T09:28:01Z

Renamed node-report is now available as npm module for Node 4 and later, on MacOS/Linux/Win/AIX/SmartOS here: https://www.npmjs.com/package/node-report, github repo here: https://github.com/nodejs/node-report

rnchamberlain mentioned this issue Jun 9, 2016

src: diagnostic NodeReport initial implementation nodejs/node#7242

Closed

4 tasks

rnchamberlain closed this as completed Feb 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for diagnostic NodeReport #24

Proposal for diagnostic NodeReport #24

rnchamberlain commented Apr 15, 2016 •

edited

Loading

bnoordhuis commented Apr 15, 2016

davepacheco commented Apr 15, 2016

rnchamberlain commented Apr 18, 2016

rnchamberlain commented Oct 25, 2016 •

edited

Loading

Proposal for diagnostic NodeReport #24

Proposal for diagnostic NodeReport #24

Comments

rnchamberlain commented Apr 15, 2016 • edited Loading

bnoordhuis commented Apr 15, 2016

davepacheco commented Apr 15, 2016

rnchamberlain commented Apr 18, 2016

rnchamberlain commented Oct 25, 2016 • edited Loading

rnchamberlain commented Apr 15, 2016 •

edited

Loading

rnchamberlain commented Oct 25, 2016 •

edited

Loading