Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ApplicationInsights crashes Node server with "RangeError: Maximum call stack size exceeded" #469

Closed
madyke opened this issue Dec 6, 2018 · 23 comments
Labels

Comments

@madyke
Copy link

madyke commented Dec 6, 2018

Hello,

We're seeing an issue where our Node server is crashing when we have AppInsights configured. When the server starts it can handle requests fine, but if we just let it run for about 30 minutes, it'll crash and give us this error message:

D:\repos\DemoPartner\build\server.js:148960
                return value.apply(this, [error, structuredStackTrace]);
                             ^

RangeError: Maximum call stack size exceeded

The stack trace is pointing to what looks like zone.js code, which is a dependency from AppInsights. We confirmed that completely removing AppInsights fixes the problem.

I thought it might be related to #294, but we're still seeing the issue after setting setAutoDependencyCorrelation(false) when configuring AppInsights.

Has anyone else seen this error, or have any suggestions?

Thanks,
Matt Dyke

@OsvaldoRosado
Copy link
Member

As a first thing to try, what happens when you add process.env['APPLICATION_INSIGHTS_NO_DIAGNOSTIC_CHANNEL'] = true; as the first line of your first JS file, rather than disabling correlation?

@madyke
Copy link
Author

madyke commented Dec 7, 2018

It crashed with the same issue unfortunately.

@OsvaldoRosado
Copy link
Member

Thanks! When you previously disabled automatic dependency correlation, did the stack trace change? If it still contained zone.js methods, it would be a sign that it wasn't disabled correctly.

@madyke
Copy link
Author

madyke commented Dec 7, 2018

So we aren't actually getting any stack traces when the server crashes, the only output is what I included in my first post. I'm working with our build team to see if there's some stack trace that isn't being output correctly, so we can investigate based on that.

I also noticed that even though I inserted the code you mentioned at the top of our first JS file, after the build process, it actually ends up about 20K lines deep in the server code. So I'm also working with the build team to see how we can make sure that code ends up at the top of the server file.

I'll update here once I get more definitive answers to your questions from our build team.

@christianrondeau
Copy link

I had the same issue (stack size exceeded after some time when launching my app, no useful stack trace). Using process.env['APPLICATION_INSIGHTS_NO_DIAGNOSTIC_CHANNEL'] = true; let me see the actual error.

Hopefully this'll help get to the root cause and prevent AI from hiding the error and crash!

@christianrondeau
Copy link

So, I'm still running with process.env['APPLICATION_INSIGHTS_NO_DIAGNOSTIC_CHANNEL'] = true; but the crash still happens (and startup is really long). The stack trace does not point directly to Application Insights, but the OP's issue does the same thing, and I didn't have these errors before as far as I can tell:

...\node_modules\zone.js\dist\zone-node.js:900
    return value.apply(this, [error, structuredStackTrace]);
                        ^

RangeError: Maximum call stack size exceeded

Not sure if it's related, but to make sure all the information is there: I only saw this crash while running locally, never on the server (or the health check silently reboots and I didn't notice). Locally I'm running with nodemon and ts-node from the console, in production I'm running transpiled javascript directly, using iisnode.

I'll still try to get more information, but it's very hard, since it only fails "sometimes", even when doing the exact same steps. Hopefully someone else can provide either similarities or differences to my case.

@madyke
Copy link
Author

madyke commented Dec 24, 2018

Thanks for chiming in - interesting to see others with similar issues, since I had begun to suspect it was being caused by something strange specific to our build process rather than an actual issue with AppInsights.

Most of our build team is out for rest of the year for the holidays, I'll update here once people are back in the office and I can get answers to Osvaldo's previous questions.

@srowan
Copy link

srowan commented Dec 24, 2018

Also seeing this issue on a fairly large node application which I added AI to recently. Not sure how to investigate...

@christianrondeau
Copy link

Maybe that's not much help, but I believe it might be related in some way to Typescript.

I tried starting the server with the debugger, waited for a few dozen seconds, and then enabled breaking on "all exceptions".

Here's what I found out:

NativeError.prepareStackTrace (c:\Dev\mediaclip-hub-graphql\node_modules\zone.js\dist\zone-node.js:891)
AppInsightsAsyncCorrelatedErrorWrapper.orig.prepareStackTrace (c:\Dev\mediaclip-hub-graphql\node_modules\applicationinsights\out\AutoCollection\CorrelationContextManager.js:178)
NativeError.prepareStackTrace (c:\Dev\mediaclip-hub-graphql\node_modules\zone.js\dist\zone-node.js:900)

This is being called in a never ending loop. The error is a NOENT for some deep file in the ts-node compilation, and that causes CorrelationContextManager.patchError to call itself through zone-node.ts

I'm not sure how to replicate easily, I think that ts-node tries loading a file that does not exist (which is fine) and Application Insights then goes crazy.

@OsvaldoRosado
Copy link
Member

Thanks for the additional analysis @christianrondeau!

I have hope #474 will be able to resolve this, by removing CorrelationContextManager.patchError entirely

@madyke
Copy link
Author

madyke commented Jan 8, 2019

Thanks for the help everyone. I synced with our build team after the holidays, and have now been able to properly implement the original suggestion of setting setAutoDependencyCorrelation(false) when configuring AppInsights.

This seems to have resolved the immediate issue of the Node server crashing (it has been running locally and successfully serving pages for ~16 hours now, previously it crashed after 30 minutes or so), but implementing this workaround has affected the ability to correlate server trace and event logs (made with a .trackEvent() or .trackTrace() call) with the original request. It looks like those events and traces no longer have an OperationID.

@OsvaldoRosado - would you have any suggestions on how to fix the broken correlation between the request and event/trace logs after setting setAutoDependencyCorrelation(false)? Or alternative workarounds that don't impact correlation in the first place? Also, do you have an ETA on when #474 will be checked in and released?

Thanks!

@OsvaldoRosado
Copy link
Member

@madyke In order to have correlation with setAutoDependencyCorrelation disabled, you must propagate this information in code yourself. If you don't use automatic telemetry collection, you can add this information using the tagOverrides argument in your track calls. Otherwise, you'll need to write a telemetry processor that can handle this (this is usually quite non-trivial).

An alternative workaround might be to replace the underlying implementation for automatic correlation. This is similar to what #474 attempts to accomplish within the SDK itself - but monkeypatched from your application's end. See #296 for an example where someone has done exactly this.

@markwolff is actively working on #474 and might have an ETA.

@madyke
Copy link
Author

madyke commented Jan 8, 2019

Thanks Osvaldo, I'll investigate to see if any of those suggestions will work for us.

Do you know more about what the actual underlying issue is, and why setting .setAutoDependencyCorrelation(false) seems to resolve the problem? If disabling auto dependency correlation fixed the issue, would doing the correlation manually by implementing your suggestions bring it back?

@OsvaldoRosado
Copy link
Member

@madyke
setAutoDependencyCorrelation, when enabled (the default state), instructs the SDK to use the zone package to establish a context propagating mechanism throughout your application. Because Node uses the single-threaded event loop model, establishing context propagation to know that an async callback was called in the frame of a certain request handler is very non-trivial and essentially requires monkeypatching every method that can go asynchronous to remember the state before async and restore it after async.

As such, zone is an incredibly invasive library and small bugs in it can cause hard to diagnose errors like this. There's also the changes we make to zone within patchError that might cause similar errors as well.

By disabling this, and manually passing around what the current operation should be without any "magic", you avoid all of these problems - at the cost of lots of annoying code to write 😄 Thus, manual implementations should not have bugs like this and are conceptually much simpler.

#474 tries to get the best of both worlds by having magic propagation and significantly reduced invasive runtime changes. This is by taking advantage of APIs that exist in newer versions of Node to address the context propagation problem from within Node itself.

Hope this explanation helps! I know it glosses over a lot of details. Feel free to ask if you'd like more information about something specific.

@madyke
Copy link
Author

madyke commented Jan 9, 2019

Thanks for the breakdown Osvaldo, it's very helpful. I'm still discussing with some other team members how we want to proceed. If #474 is expected to fix the issue, then we'd rather not have to do a lot of work to implement a hacky fix on our end, all of which would be throwaway effort once #474 is in.

That said, we need AppInsights correlating our telemetry before our project goes into private preview on Jan. 24th. @markwolff / @OsvaldoRosado - is there any chance that this fix will be checked in soon? I'm not familiar with your release cadence, once the change is checked in what's the process for getting it released so we can use it?

@markwolff
Copy link
Contributor

@madyke It seems #474 is passing tests and ready to go after addressing a few more remaining items. There is nothing else planned to be tied in with the next release. I'm probably going to release it as a beta first, so it'd be awesome if you could test it out and let me know if it fixes all of your problems. I'll let you know when it's published!

@christianrondeau
Copy link

christianrondeau commented Jan 13, 2019

That's 1.1.0-beta.0 right? Since it's not happening reliably it's hard to tell if it's fully fixed, but I wasn't able te replicate the infinite loop yet! I'll report back if I'm still having issues, but up to now it's looking good!

@markwolff
Copy link
Contributor

@christianrondeau yes 1.1.0-beta.0 is the version where #474 was published under. Glad to hear no issues no far, please let me know if you run into any!

@christianrondeau
Copy link

Ok, I think I can confirm the issue is solved. I had a startup exception (which always ended up with an infinite loop "the first time", and I have instead:

...\node_modules\applicationinsights\out\AutoCollection\Exceptions.js:34
                        throw THIS_IS_APPLICATION_INSIGHTS_RETHROWING_YOUR_EXCEPTION; // Error originated somewhere else in your app

Got me confused for a second, but I understand why you did this weird throw line :) So as far as I can tell this is "normal behavior", and the infinite loop is not happening anymore even in cases that were reliably happening on my end before.

@markwolff
Copy link
Contributor

@christianrondeau Good to hear. I'm assuming the startup exception is fixable? or is it on the SDK end?

I'll close this issue for now as it seems to be fixed. Please reopen if this is not the case.

@madyke
Copy link
Author

madyke commented Jan 14, 2019

Hi everyone - it seems like this release has solved the issue for us as well. I was able to leave our server running over the entire weekend and it is still alive and serving requests so far without any sign of the previous errors.

If I do start seeing the issue again I'll update here, but for now it looks like this is fixed. Thanks everyone for your help!

@christianrondeau
Copy link

To answer your question @markwolff the error itself was mine and expected (it was an early GraphQL precompilation error in my code, which usually triggered the infinite loop), so unrelated to this issue. Thanks a lot for following up, I'll use the beta until the official release it published!

@markwolff
Copy link
Contributor

@christianrondeau Forgot to update this thread to mention that the stable version was published some time ago. Latest is now 1.2.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants