-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outbound requests fail intermittently after the proxy reported "panicked at 'cancel sender lost'" #6086
Comments
Thanks for letting us know. I've deleted the v2.142.0 proxy release -- it hit some other issues during integration tests and isn't yet ready for public consumption. In general, I'd only recommend using proxy versions that have been released on an edge release. Is there a specific reason you picked up v2.142.0? Are you using a patched version of the proxy? For what it's worth, we plan on releasing a v2.10.2 release that use proxy version v2.141.1. |
Hey @olix0r thanks for the reply. So...
Just wanted to point out that the issue existed before we upgraded the proxy to v2.142. We were using v2.139. We saw the 1st occurrence during that time. I will update the issue description. |
@Wenliang-CHEN We've been trying to reproduce this in library tests but haven't been able to get a solid lead on what's going on. Next week, we'll put together a branch that increases diagnostic logging in the ready-cache & balancer and ask you to test that out, if that works for you. |
Hey @olix0r thanks for the update. We will upgrade to it once it is ready. Also an update from our side: We are not able to reproduce the issue either. But when it happens, we observed high request rate for outbound traffic. It seems that it goes in such pattern during the incident:
|
Hi @Wenliang-CHEN, I've published a linkerd proxy image
that would be extremely helpful. Thanks! |
specifically, you'll want to set workload annotations:
|
Hello again. We are able to reproduce the issue with debug proxy. We observed the following pattern during the incident: First we saw large number of logs like the following:
Then we saw the panic log
Then the following
Afterwards, the full connection lifecycle looks like this:
What's worth mentioning
Thanks! |
Thank you for the helpful debug info!
Oliver Gould < ***@***.*** >
…On Wed, May 19, 2021 at 5:14 AM, Bruce Chen Wenliang < ***@***.*** > wrote:
Hello again. We are able to reproduce the issue with debug proxy. We
observed the following pattern during the incident:
First we saw large number of logs like the following:
DEBUG ThreadId(01)
outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}:logical{
dst=service-name. prod. svc. cluster. local:80 (
http://dst=service-name.prod.svc.cluster.local/ ) }:concrete{ addr=service-name-primary.
prod. svc. cluster. local:80 (
http://addr=service-name-primary.prod.svc.cluster.local/ ) }:
tower::ready_cache::cache: endpoint canceled
Then we saw the panic log
thread 'main' panicked at 'cancel sender lost', / usr/ local/ cargo/ registry/
src/ github. com-1ecc6299db9ec823/ tower-0. 4. 7/ src/ ready_cache/ cache.
rs:397:13 (
http://usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.7/src/ready_cache/cache.rs:397:13
)
Then the following
[ 25455.324175s] WARN ThreadId(01)
outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}
: linkerd_app_core::errors: Failed to proxy request: buffered service
failed: panic client.addr=10.250.187.39:35960
[ 25455.324282s] DEBUG
ThreadId(01) outbound:accept{client.addr=10.250.187.39:35960}:
linkerd_app_core::serve: Connection closed
[ 25455.324205s] DEBUG
ThreadId(01)
outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}
: linkerd_app_core::errors: Handling error with HTTP response status=502
Bad Gateway version=HTTP/1.1
[ 25455.324199s] DEBUG ThreadId(01)
outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}
: linkerd_app_core::errors: Closing server-side connection
[
25455.324251s] DEBUG ThreadId(01)
outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}
: linkerd_proxy_http::server: The stack is tearing down the connection
[ 25455.318532s] DEBUG ThreadId(01)
outbound:accept{client.addr=10.250.187.39:35960}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}
:logical{ dst=service-name. prod. svc. cluster. local:80 (
http://dst=service-name.prod.svc.cluster.local/ ) }:concrete{ addr=service-name-primary.
prod. svc. cluster. local:80 (
http://addr=service-name-primary.prod.svc.cluster.local/ ) }:
tower::ready_cache::cac he: endpoint canceled
Afterwards, the full connection lifecycle looks like this:
[ 25455.329021s] DEBUG ThreadId(01)
outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}:
linkerd_app_core::errors: Handling error with HTTP response status=502 Bad
Gateway version=HTTP/1.1
[ 25455.329114s] DEBUG ThreadId(01)
outbound:accept{client.addr=10.250.187.39:36226}: linkerd_app_core::serve:
Connection closed
[ 25455.329072s] DEBUG ThreadId(01)
outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}:
linkerd_proxy_http::server: The stack is tearing down the connection
[
25455.328793s] DEBUG ThreadId(01)
outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:
linkerd_detect: DetectResult protocol=Some(Http1) elapsed=17.323µs
[
25455.328895s] DEBUG ThreadId(01)
outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}:
linkerd_proxy_http::server: Handling as HTTP version=Http1
[
25455.329003s] WARN ThreadId(01)
outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}:
linkerd_app_core::errors: Failed to proxy request: buffered service
failed: buffered service failed: panic client.addr=10.250.187.39:36226
[ 25455.329015s] DEBUG ThreadId(01)
outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}:
linkerd_app_core::errors: Closing server-side connection
[
25455.328821s] DEBUG ThreadId(01)
outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}:
linkerd_proxy_http::server: Creating HTTP service version=Http1
[
25455.328971s] DEBUG ThreadId(01)
outbound:accept{client.addr=10.250.187.39:36226}:server{orig_dst=172.20.207.194:80}:profile:http{v=1.x}:logical{
dst=service-name. prod. svc. cluster. local:80 (
http://dst=service-name.prod.svc.cluster.local/ ) }:
linkerd_service_profiles::http::route_request: Updating HTTP routes
routes=0
What's worth mentioning
-----------------------
* The destination is also running with a Linkerd proxy - nginx - fpm setup
* The target endpoint at destination is slow. It reaches execution timeout
at 60s, which I think is where the "502 bad gateway" comes from
* We did not paste the "keep alive" logs here as we think those are
irrelevant. Please let us know if you need them too.
Thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub (
#6086 (comment) ) ,
or unsubscribe (
https://github.com/notifications/unsubscribe-auth/AAB2YYWAWZVC53RVSLMWDTDTOOTS7ANCNFSM434LQD4Q
).
|
Hi @Wenliang-CHEN, is it possible to get more complete logs starting from before the first "endpoint canceled" message was logged? Thank you! |
Hey @hawkw sure, please find the attached. You will find the 1st occurrence of "endpoint cancelled" at line 12. And the "cancel sender lost" at line 185. |
This branch updates the `futures` crate to v0.3.15. This includes a fix for task starvation with `FuturesUnordered` (added in 0.3.13). This may or may not be related to issues that have been reported in the proxy involving the load balancer (linkerd/linkerd2#6086), but we should update to the fixed version regardless. This may also improve performance in some cases, since we may now have to do fewer poll-wakeup cycles when a load balancer has a large number of pending endpoints.
This branch updates the `futures` crate to v0.3.15. This includes a fix for task starvation with `FuturesUnordered` (added in 0.3.13). This may or may not be related to issues that have been reported in the proxy involving the load balancer (linkerd/linkerd2#6086), but we should update to the fixed version regardless. This may also improve performance in some cases, since we may now have to do fewer poll-wakeup cycles when a load balancer has a large number of pending endpoints.
This branch updates the `futures` crate to v0.3.15. This includes a fix for task starvation with `FuturesUnordered` (added in 0.3.13). This may or may not be related to issues that have been reported in the proxy involving the load balancer (linkerd/linkerd2#6086), but we should update to the fixed version regardless. This may also improve performance in some cases, since we may now have to do fewer poll-wakeup cycles when a load balancer has a large number of pending endpoints.
We've been stress testing the tower ready-cache dependency and have been able to trigger some unexpected behavior, though not the exact problems you seem to have captured. There's at least one fix we picked up from futures (rust-lang/futures-rs#2333) that should eliminate some pathological behavior when there are many endpoints in a balancer. In your case roughly how many endpoints should exist in the target service? Are endpoints churning (being deleted/created) frequently? Or are they relatively static? |
Thanks for the update @olix0r. The target service is a monolith that has around 200 endpoints. I would say 90% of them are static. And the service that makes the outbound call uses 8 of them, all static. |
@Wenliang-CHEN Sorry, I should have been clearer: how many pods of the service are running? Are these panics at all correlated with deployments/restarts of the target service? |
@olix0r there are 6 pods running. We did observe coincidence between the deployment of the target service and the panics. But it does not always happen. When panic happens, it does not happen to all the pods. It is mostly 1 or 2 pods. |
@Wenliang-CHEN Thanks, this is helpful. I doubt that the futures change will help this issue. I suspect that there's a race condition around updating the balancer with new endpoints where we enter an illegal state. We'll focus more on stress testing the update path. |
This branch updates the `futures` crate to v0.3.15. This includes a fix for task starvation with `FuturesUnordered` (added in 0.3.13). This may or may not be related to issues that have been reported in the proxy involving the load balancer (linkerd/linkerd2#6086), but we should update to the fixed version regardless. This may also improve performance in some cases, since we may now have to do fewer poll-wakeup cycles when a load balancer has a large number of pending endpoints.
linkerd/linkerd2#6086 describes an issue that sounds closely related to tower-rs#415: There's some sort of consistency issue between the ready-cache's pending stream and its set of cancelations. Where the latter issues describes triggering a panic in the stream receiver, the former describes triggering a panic in the stream implementation. There's no logical reason why we can't continue to operate in this scenario, though it does indicate a real correctness issue. So, this change prevents panicking in this scenario when not building with debugging. Instead, we now emit WARN-level logs so that we have a clearer signal they're occurring. Finally, this change also adds `Debug` constraints to the cache's key types (and hence the balancer's key types) so that we can more reasonably debug this behavior.
linkerd/linkerd2#6086 describes an issue that sounds closely related to tower-rs#415: There's some sort of consistency issue between the ready-cache's pending stream and its set of cancelations. Where the latter issues describes triggering a panic in the stream receiver, the former describes triggering a panic in the stream implementation. There's no logical reason why we can't continue to operate in this scenario, though it does indicate a real correctness issue. So, this change prevents panicking in this scenario when not building with debugging. Instead, we now emit WARN-level logs so that we have a clearer signal they're occurring. Finally, this change also adds `Debug` constraints to the cache's key types (and hence the balancer's key types) so that we can more reasonably debug this behavior.
We've had a lot of trouble reproducing this in tests, but I think this is very likely a manifestation of the same problem described in tower-rs/tower#415. I'm especially suspicious of I've put together a
I recommend setting the following annotations on your pod template: config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
config.linkerd.io/proxy-version: tower-ready-debug.ab6c68ee
config.linkerd.io/proxy-log-level: linkerd=info,tower::ready_cache=debug,warn Your application should no longer panic. If you see WARN-level logs, it would be great if we could capture the preceding logs to get a better sense of the access pattern that may be triggering this. |
Hey @olix0r thanks for the update. I will get the service running with the new proxy. will report here once I get anything interesting. |
Hello @olix0r, so...we let the service run for 2 days with the new proxy. The panic does not happen anymore. And we do notice some WARN message in the logs. Attached please find the the samples. Please let me know if anything is missing or incomplete. |
Thanks! Interestingly, it looks like we haven't triggered the scenario we've seen previously: in those logs most of the warnings appear to come from the reconnect module (which is more-or-less expected if the target endpoint isn't available); but none of those warnings should really impact the balancer/ready_cache. The logs I'm most interested in catching are If you want to filter out the reconnect log messages you could run with the log level |
Cool, I will keep it running. Maybe we just need a bit more time until the scenario gets triggered. |
@Wenliang-CHEN have you seen any of these warnings over the past week? |
Hello @olix0r , we have been running the service with the new proxy, the same log level for a week. There is no log related to "Pending service lost..." or "Ready service had no associated cancelation". And we observe no connection issue from that service any more. Is it possible that the upgrade of the library somehow fixes/suppresses the issue? |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions. |
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
Bug Report
What is the issue?
The outbound requests of a meshed pod fail intermittently after its linkerd-proxy reported "panicked at 'cancel sender lost'".
We are not sure what triggers the issue. From the logs we can tell that after the linkerd-proxy emits the following:
Then around 50% of the outbound requests starts failing intermittently with message:
Additional context
The outbound destination is also a meshed service.
The linkerd-init container exited with "Completed" status in the pod.
Before and during the incident, there was no restart in either the application container or the proxy container.
Once we restarted the pod manually, the outbound traffic succeeds at 100% again.
linkerd check
outputEnvironment
control plane: v2.10.0
linkerd-proxy: happened both for v2.139 and v2.142
linkerd-init: cr.l5d.io/linkerd/proxy-init:v1.3.9
The text was updated successfully, but these errors were encountered: