Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node 8 / Async: Auto-Correlation support #381

Closed
GavinOsborn opened this issue Apr 3, 2018 · 19 comments
Closed

Node 8 / Async: Auto-Correlation support #381

GavinOsborn opened this issue Apr 3, 2018 · 19 comments

Comments

@GavinOsborn
Copy link

TL/DR
Auto dependency correlation appears to fail in all but the most trivial async/await scenarios.
I understand that you can't always rely on context propagation in nodejs but this seems too simple to be expected.

In more detail
Everything I'm about to say can be reproduced from this gist.

Recently I converted an existing application from a node 6 web API to node 8 and began updating the code-base to leverage async/await functionality. This application integrates with various Azure PaaS systems including Cosmos DB and Azure Storage. Everything was working great until we checked our application insights integration. I noticed that all of the data was there but all of a sudden our tracing and dependency correlation was broken.

Checking the issues page I thought it may be related to #296 (our application is quite complex) so I tried to boil it down to the simplest of repro steps - and it really was simple.

Looking at the gist you'll see that executing 1 async operation generates perfectly good dependency correlation
curl http://localhost:3000/endpoint/simple

Type Operation Id Request Url
RemoteDependencyData 987D4A34-7228-428F-B366-C007EA54C8D8 987D4A34-7228-428F-B366-C007EA54C8D8.1 https://github.com:443/Microsoft/ApplicationInsights-node.js/
RequestData 987D4A34-7228-428F-B366-C007EA54C8D8 987D4A34-7228-428F-B366-C007EA54C8D8. http://localhost:3000/endpoint/simple

But executing a series of identical async operations will drop the context half-way through the request
curl http://localhost:3000/endpoint/complex

Type Operation Id Request Url
RemoteDependencyData 2199F028-C94E-4A61-BCE3-9C019390A632 2199F028-C94E-4A61-BCE3-9C019390A632.1. https://github.com:443/Microsoft/ApplicationInsights-node.js/
RemoteDependencyData undefined 2AA17EDD-073D-4538-9D87-4920E52FC1A7 https://github.com:443/Microsoft/ApplicationInsights-node.js/
RemoteDependencyData undefined 19AC35E4-50F7-4091-B19D-8B5205559D0C https://github.com:443/Microsoft/ApplicationInsights-node.js/
RemoteDependencyData undefined 7B41AD91-91B4-47DD-9D30-CE536E81953A https://github.com:443/Microsoft/ApplicationInsights-node.js/
RemoteDependencyData undefined 032F748C-B57C-4F79-A187-F954166FD1B9 https://github.com:443/Microsoft/ApplicationInsights-node.js/
RequestData 2199F028-C94E-4A61-BCE3-9C019390A632 2199F028-C94E-4A61-BCE3-9C019390A632. http://localhost:3000/endpoint/complex

But switching this to either a promise chain or a generator based solution produces perfect correlation
curl http://localhost:3000/endpoint/chained

Type Operation Id Request Url
RemoteDependencyData DC7B1746-5ED7-40AB-8557-31C71BF1132A DC7B1746-5ED7-40AB-8557-31C71BF1132A.1. https://github.com:443/Microsoft/ApplicationInsights-node.js/
RemoteDependencyData DC7B1746-5ED7-40AB-8557-31C71BF1132A DC7B1746-5ED7-40AB-8557-31C71BF1132A.2. https://github.com:443/Microsoft/ApplicationInsights-node.js/
RemoteDependencyData DC7B1746-5ED7-40AB-8557-31C71BF1132A DC7B1746-5ED7-40AB-8557-31C71BF1132A.3. https://github.com:443/Microsoft/ApplicationInsights-node.js/
RemoteDependencyData DC7B1746-5ED7-40AB-8557-31C71BF1132A DC7B1746-5ED7-40AB-8557-31C71BF1132A.4. https://github.com:443/Microsoft/ApplicationInsights-node.js/
RemoteDependencyData DC7B1746-5ED7-40AB-8557-31C71BF1132A DC7B1746-5ED7-40AB-8557-31C71BF1132A.5. https://github.com:443/Microsoft/ApplicationInsights-node.js/
RequestData DC7B1746-5ED7-40AB-8557-31C71BF1132A DC7B1746-5ED7-40AB-8557-31C71BF1132A. http://localhost:3000/endpoint/chained

Environment:

  • OS: Windows
  • Node Version: 8.11.0
  • 3rd party libraries: none (beyond the chosen web framework)

I understand that you can't always rely on context propagation in nodejs but this seems too simple to be expected.

Am I doing something wrong?
Is this the same issue as #296?

Please advise.

@OsvaldoRosado
Copy link
Member

OsvaldoRosado commented Apr 3, 2018

I don't believe this is the same as #296 but something deeper. It's my current understanding that we don't have the proper integration points with our monkeypatching approach to apply context propagation across native async /await (vs say async/await transpiled into Promises). In newer versions of node the experimental async_hooks API may be providing the necessary information for us to propagate context, but we don't yet support this (and I believe it has performance implications).

Some things that I think are related:

@GavinOsborn
Copy link
Author

GavinOsborn commented Apr 3, 2018

Thanks @OsvaldoRosado.

Reading angular/zone.js#715 I agree that this does sound v.similar.
The problem with waiting for async_hooks of course is that we are basically admitting we don't have a native solution for async/await for Node 8 (currently LTS) and none can be expected in LTS until Node 10 (best case scenario).

If this limitation is something that you are choosing to adopt (rather than considering alternatives) then I really feel this needs to be quite explicitly documented - including alternatives / workarounds.

Speaking of workarounds, are you able to offer any suggestions for how we might move forward in the interim?

  • Transpilation with Babel
  • avoid native async/await
  • ...?

Alternative
Apologies if this is a naive question - I don't know much about the internals of your library but...

Have you considered continuation-local-storage as opposed to Zones for context propogation?
I have no experience with Zones but I've been successfully using cls for several years including native async/await and I have not run into these problems. I notice @tirish utilized it in his their workaround for #296 also.

One obvious drawback is a lack of browser support. But would it be possible to plug-in/configure your context propogation technology of choice?

@OsvaldoRosado
Copy link
Member

OsvaldoRosado commented Apr 3, 2018

Suggestions:
Transpilation or otherwise avoiding native async/await will probably get you working the fastest. However, we do also provide a code-based solution rather than a build-based solution.

The appInsights.wrapWithCorrelationContext function can be used to hook into the SDK's internal context manager and fix broken correlation. So you could do code like this (though admittedly it's bringing you back to callback-style):

let nextStuff = appInsights.wrapWithCorrelationContext(() => {
  handleResponse(); // Context here is equal to what it was when let was defined
});
await asyncCall();
nextStuff();

Are we waiting for async_hooks?:
I believe async_hooks is the only API providing sufficient data for context preservation for async/await. Assuming this is the case, it must be involved in any real solution we come up with here. I imagine continuation-local-storage must be using this experimental API internally to work in this scenario. We can of course choose to depend on it while it's still experimental, and might get this just by updating our zone.js dependency.

Clearly documenting some of the known limitations we have because we don't yet have an implementation that takes advantage of async_hooks sounds totally fair to me.

Why zone.js over cls?:
We considered both (as well as manual implementations over async_hooks and a few other APIs) when developing the correlation feature. We found that zones needed the fewest additional monkeypatches to work correctly in some real-world services we instrumented prerelease versions of this SDK with.

Browser support isn't a main concern for us, since this SDK is intended for server-side node vs our separate client-side js package. If the compatibility landscape has changed between zone and cls we could consider changing our underlying context propagation.

Having support for dropping in your own context propagation sounds like a great idea to me, since this is something that will definitely vary from app-to-app. Our library for managing context and collection patches for third-party modules like Redis and Mongo (https://github.com/Microsoft/node-diagnostic-channel/) already supports this, so it might not be much effort to expose it through this SDK as well.

@GavinOsborn
Copy link
Author

Yeah, appInsights.wrapWithCorrelationContext isn't a great option for this particular problem but I can see how it might be useful. I'm not a fan of solutions that rely on developer discipline for mission critical features like instrumentation and support-ability (is that a word?) of our apps.

I think the zone/cls debate is fairly moot if they would both depend upon async_hooks
Though I suppose having the option to switch to an experimental API for context propagation would still be nice - I guess.

I'm currently leaning towards a Node6/8 implementation with co/generators until native async is fully supported. But that's a conversation for my team.

Linting checks for async/await should be an easy build time solution.

@janis91
Copy link

janis91 commented Jul 13, 2018

Hi, as this is already an issue that is a bit older, I wanted to ask, if there were some steps forward with this or if you can propose a workaround. In our current code we use TypeScript and async/await, the compile target is ES6, but also with ES5 it isn't working.

What we do is just a REST request to get some additional information. All other traces (logs from winston) are tracked within the right context of the incoming request, but our request to the other service is without operation id. I guess therefore this is the same issue.

What I also tried is the usage of wrapWithCorrelationContext(). Unfortunately the same result.

Is there any other way you can think of, that I could try beside avoiding async/await?

btw. we are using the "request" library (like almost everyone I guess :-D )

@OsvaldoRosado
Copy link
Member

@janis91 No steps forward for now other than to avoid async/await (assuming you are being impacted by this issue and not something else). When your compile target is set to ES5 is typescript converting your async/await to promises/generators in the resulting JS?

@OsvaldoRosado
Copy link
Member

@janis91 Actually from

Our request to the other service is without operation id. I guess therefore this is the same issue.

I think you might be hitting a different issue. Does this seem to be #396 ? If this problem is with correlation between services (requests should be associated with dependencies that caused them) rather than correlation of telemetry within a service (dependencies should be associated with requests that caused them) this might be the case

@janis91
Copy link

janis91 commented Jul 16, 2018

@OsvaldoRosado

When your compile target is set to ES5 is typescript converting your async/await to promises/generators in the resulting JS?

It is converting it to generators, if I'm right:

"use strict";
var __awaiter = (this && this.__awaiter) || function (thisArg, _arguments, P, generator) {
    return new (P || (P = Promise))(function (resolve, reject) {
        function fulfilled(value) { try { step(generator.next(value)); } catch (e) { reject(e); } }
        function rejected(value) { try { step(generator["throw"](value)); } catch (e) { reject(e); } }
        function step(result) { result.done ? resolve(result.value) : new P(function (resolve) { resolve(result.value); }).then(fulfilled, rejected); }
        step((generator = generator.apply(thisArg, _arguments || [])).next());
    });
};
function someAsyncFunction() {
    return __awaiter(this, void 0, void 0, function* () {
        someAwaitedFunction();
    });
}

I think you might be hitting a different issue. Does this seem to be #396 ?

I think it's most likely this issuer here. I try to explain again:

  1. Request from user hits our service (AI sets an operation id)
  2. Authorization takes place by calling redis and asking for a cached result (the redis call is correlated correctly - same operation id)
  3. There is no authorization response cached yet.
  4. Our service calls (request library) the authorization server and asks for authorization (the request will be sent to AI but WITHOUT operation id)
  5. Another call to redis caches the result of the former request (correlated correctly - same operation id)
  6. Database call (cosmosdb) is made (dependency call correlated correctly - same operation id)
  7. Responds with result

This is a typical scenario in our service and unfortunately step 4 is causing that requests to other third-party services are not tracked within the same operation. I can find them tracked in AI at the same time, but without any correlation because the operation id is undefined.

@OsvaldoRosado
Copy link
Member

@janis91 Thanks for the extra detail!

Just for clarity on step 4:

Our service calls (request library) the authorization server and asks for authorization (the request will be sent to AI but WITHOUT operation id)

Is this item that's not correlated here the dependency recorded by the calling server, or the request recorded by the authorization server?

@janis91
Copy link

janis91 commented Jul 17, 2018

@OsvaldoRosado

Is this item that's not correlated here the dependency recorded by the calling server, or the request recorded by the authorization server?

The dependency. The authorization server is a third party, I don't have control over that one.

@OsvaldoRosado
Copy link
Member

Adding a note to this thread that #474 should address this and allow correlation to work across async/await.

@GavinOsborn
Copy link
Author

Hey @OsvaldoRosado that's great news.
Are there any cavaeats we should be aware of for this to work across async boundaries?
Un-supported node versions, etc.

@khairihafsham
Copy link

On the cls-hooked page, it mentions

When running Nodejs version < 8, this module uses AsyncWrap which is an unsupported Nodejs API, so please consider the risk before using it.
When running Nodejs version >= 8.2.1, this module uses the newer async_hooks API which is considered Experimental by Nodejs.

From my understanding, for that range of versions, user should do some verification if everything works fine.

@markwolff
Copy link
Contributor

Just adding to what @khairihafsham stated, the async API we use changes at Node 8.2.0. From looking at cls-hooked code, this seems to be the case as well (not at 8.2.1 as it is documented there). Before 8.2.0, we use async-listener (via plain cls library) instead of async_hooks (cls-hooked).

https://github.com/Microsoft/ApplicationInsights-node.js/blob/7953d1b2f2d200f333f5486705e2002784df9551/AutoCollection/CorrelationContextManager.ts#L183-L191

https://github.com/Microsoft/ApplicationInsights-node.js/blob/7953d1b2f2d200f333f5486705e2002784df9551/AutoCollection/CorrelationContextManager.ts#L140-L144

@OsvaldoRosado
Copy link
Member

@khairihafsham We only load cls-hooked for Node > 8.2, otherwise we use continuation-local-storage. This avoids the kind of trouble you quoted 😄

@GavinOsborn You'll need to be on at least Node 8.2 to get correlation tracking working across native async/await. There may also be performance penalties to the Node APIs used to monitor async/await and Promises depending on your particular workload. I'd expect newer Node the better in terms of reducing these penalties.

@GavinOsborn
Copy link
Author

Thanks for the update @OsvaldoRosado @markwolff.

Is it fair to say that async/await support is still use at your own risk at this stage?

@markwolff
Copy link
Contributor

@GavinOsborn It has been released to stable now (1.1.0). I'm pretty confident it will be more reliable than our previous zone.js implementation.

@OsvaldoRosado
Copy link
Member

@GavinOsborn as far as "use at your own risk" is concerned, I wouldn't expect anything like application crashes. (and as @markwolff says, it's certainly a lot more stable than our zone implementation was in this regard 😄) The underlying Node APIs are indeed non-final, but that's more of a problem of our SDK updating to support newer versions of Node when/if the APIs change.

The main thing to watch is if your particular workload gets a noticably higher performance overhead from the underlying V8 async/await monitoring that upgrading Node versions doesn't mitigate.

@markwolff
Copy link
Contributor

Closing this since it has been implemented and can be considered production ready at this point

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants