You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Darwin Srens-MacBook-Air.local 23.6.0 Darwin Kernel Version 23.6.0: Wed Jul 31 20:53:05 PDT 2024; root:xnu-10063.141.1.700.5~1/RELEASE_ARM64_T8112 arm64
Description
We have noticed a weird situation in our production traces for our axum server where sometimes an endpoint, even if the endpoint makes a response very quickly, that the traces say that the endpoint took a very long time. This doesn't happen all the time, but when it happens, it is always related to reqwest calls (i.e. the reqwest crate).
A trace could look something like this (this is how it looks in our prod metrics platform in azure, you will see below that another platform shows the endpoint as shorter)
that is, the response is done quite quickly, but because the reqwest call seems to hang, the whole endpoint shows as taking a long time. Now, the reqwest calls actually all finish very quickly. You can see that the "stuff_after" continues shortly after, and we have done other approaches to show that they actually finish fast. So something is hanging in the reqwest client that makes this show as taking a long time.
We even added a field to the root request span (using tower tracing) that shows that the endpoint is done quickly. So from the users point of view, nothing is taking a long time, but for us, we can't trust our tracing.
To reproduce, I started a local open telemetry platform (signoz or jaeger), started the server in the repo i linked, and then ran seq 1 5000 | xargs -P0 -I{} curl http://127.0.0.1:8008
In signoz, you should then be able to see cases like these:
You can see in the bottom right, that the endpoint took 0 seconds, but resolve took 35 seconds.
I am not sure if this is strictly the fault of tracing, or something weird happens internally in reqwest or somewhere else.
The text was updated successfully, but these errors were encountered:
SorenHolstHansen
changed the title
reqwest seems to hand around after instrumented function is done
reqwest seems to hang around after instrumented function is done
Sep 26, 2024
I believe stuff like this happens when something holds onto a span longer than expected. I know this happened to me with h2 and I think it could be a cause here since reqwest pools connections.
Basically the connection has a span that is a child of your endpoint span and thus keeps it alive.
You can try doing something like let _guard = info_span!(parent :none "dummy").entered(); just before doing the reqwest call (and maybe dropping the guard right after). That should fix the endpoint span but you'll no longer see the reqwest span in your trace.
Otherwise just filtering reqwest out should also fix your trace length.
I don't think this can really be fixed in tracing, tracing itself can just give advise on whether libraries should or should not hold onto spans like this and better convey the nuances and what will the user see.
Bug Report
Version
Platform
Darwin Srens-MacBook-Air.local 23.6.0 Darwin Kernel Version 23.6.0: Wed Jul 31 20:53:05 PDT 2024; root:xnu-10063.141.1.700.5~1/RELEASE_ARM64_T8112 arm64
Description
We have noticed a weird situation in our production traces for our axum server where sometimes an endpoint, even if the endpoint makes a response very quickly, that the traces say that the endpoint took a very long time. This doesn't happen all the time, but when it happens, it is always related to reqwest calls (i.e. the reqwest crate).
A trace could look something like this (this is how it looks in our prod metrics platform in azure, you will see below that another platform shows the endpoint as shorter)
that is, the response is done quite quickly, but because the reqwest call seems to hang, the whole endpoint shows as taking a long time. Now, the reqwest calls actually all finish very quickly. You can see that the "stuff_after" continues shortly after, and we have done other approaches to show that they actually finish fast. So something is hanging in the reqwest client that makes this show as taking a long time.
We even added a field to the root request span (using tower tracing) that shows that the endpoint is done quickly. So from the users point of view, nothing is taking a long time, but for us, we can't trust our tracing.
I made a reproducible case here: https://github.com/SorenHolstHansen/reqwest_tracing_bug
To reproduce, I started a local open telemetry platform (signoz or jaeger), started the server in the repo i linked, and then ran
seq 1 5000 | xargs -P0 -I{} curl http://127.0.0.1:8008
In signoz, you should then be able to see cases like these:
You can see in the bottom right, that the endpoint took 0 seconds, but resolve took 35 seconds.
I am not sure if this is strictly the fault of tracing, or something weird happens internally in reqwest or somewhere else.
The text was updated successfully, but these errors were encountered: