Skip to content
This repository has been archived by the owner on Aug 7, 2023. It is now read-only.

Proposal for diagnostic NodeReport #24

Closed
rnchamberlain opened this issue Apr 15, 2016 · 4 comments
Closed

Proposal for diagnostic NodeReport #24

rnchamberlain opened this issue Apr 15, 2016 · 4 comments

Comments

@rnchamberlain
Copy link
Contributor

rnchamberlain commented Apr 15, 2016

This is a proposal for a human-readable summary dump/report, written to disk on the various failure scenarios. It is intended to preserve information for initial problem determination/triage, and for use
in combination with core dumps when more detailed analysis is needed. Configuration is via command line option(s)

Initial implementation is here:
https://github.com/rnchamberlain/node/compare/master...node-report-3

Documentation and sample output here:
https://github.com/rnchamberlain/node/wiki/NodeReport

Any feedback appreciated!

@bnoordhuis
Copy link
Member

@davepacheco
Copy link

Thanks for this! This looks useful.

For what it's worth, we've been bitten in the past by code that attempts to run after arbitrarily failures like this. For our use cases, I don't think we would typically take on the risk of such code failing or modifying the state in the core file. For the case of --abort-on-uncaught-exception, we really want to execute as little as possible between the failure and the fatal signal handler. It looks like the new reporting behavior is opt-in, so that would just work.

A few questions and comments on the change:

  • Is the expectation that this behavior would be triggered only by fatal failure, or is it possible for non-fatal failure to result in a report?
  • Is it expected that this would be merged into V8, or maintained as a separate patch to Node? Do you know if the V8 team is receptive to these changes (e.g., exposing the previously-private HeapStats class)?
  • Can you explain the macro-assembler-x64.cc change?
  • I'd suggest parsing the nodereport_events argument more precisely, rather than using strstr(). If other options are added later with overlapping names, the resulting behavior would be surprising.
  • On that note: I would consider more precise names. What's the difference between "error" and "exception"?
  • I'm not sure it's a good idea to add a handler for SIGUSR2. How does this change the behavior for existing Node programs that use SIGUSR2?
  • Do you think the output file should include the pid and timestamp, so that if multiple processes in the same directory crash, they're less likely to clobber each others' reports? I remember the JVM produces similar files in some cases. How are those named?

Out of curiosity, what was the motivation for this change? (We would typically build something that produced a report from the core file instead, but I imagine there may be use cases where that doesn't work?)

Thanks again.

@rnchamberlain
Copy link
Contributor Author

@davepacheco many thanks for the feedback, appreciated.

Yes, we have to be very careful not to make things worse. For javascript problems like unhandled exception or javascript heap out of memory, the process should still be running fine. So as long as we stay in C++ code and don't try to run any javascript we should be OK. For native crashes/out of memory we found we could still do this type of report from the JVM, but we had to be very careful to avoid multiple dumps, potential hangs and secondary crashes (eg avoid locks and malloc, use a SIGSEGV handler and bail out if it triggers). Ideally the core dump is produced first, but that is hard to do on Linux because there is no OS 'dump me' API.

re the questions

  • not just for fatal failures, other triggers could be the signal, an API call, or a trace point etc
  • hopefully merged into V8, I think they are receptive to improvements that make life easier for V8 embedders. The internal "HeapStats" interface is a bit strange, it has a set of pointers to integers etc, so quite verbose to use. A better option would be to extend the existing external V8 class "HeapStatistics", if we can do that
  • the macro-assembler-x64.cc change is a hack to allow the PrintStack call to work in the signal trigger case (it needs the top frame pointer), I am still investigating that
  • agree, the strstr() code is temporary
  • agree, and the event names need to make sense to a JS developer. In the prototype, 'exception' means unhandled JS exception and 'error' means internal error, i.e. anything that calls OnFatalError() in Node.
  • yes, the signal needs to be configurable, it's likely too late to try to grab/reserve a specific signal for Node diagnostic use
  • yes, pid and timestamp need to be added to the file name. A nice way to do that is as tokens, eg a file name of "NodeReport.%t,%pid.text", as /proc/sys/kernel/core_pattern provides for core dumps (and was done for the JVM dumps)

Motivation is to provide something that users can easily/immediately read. Running a debugger or other tool on a core dump is a bit of a hurdle for some. The core dump is the primary post-mortem diagnostic, this is intended as a useful (optional) extra resource. Experience with the Java equivalent was very positive. There are some things that are easier to do at runtime (eg native/C stack traces with symbols, as the libraries are immediately available)

@rnchamberlain
Copy link
Contributor Author

rnchamberlain commented Oct 25, 2016

Renamed node-report is now available as npm module for Node 4 and later, on MacOS/Linux/Win/AIX/SmartOS here: https://www.npmjs.com/package/node-report, github repo here: https://github.com/nodejs/node-report

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants