-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outbound requests for some downstream services consistently fail after the proxy reported "panicked at 'cancel sender lost'" #8666
Outbound requests for some downstream services consistently fail after the proxy reported "panicked at 'cancel sender lost'" #8666
Comments
Related to #6086 tower-rs/tower#415 |
Hi! A few updates:
Other information that might be relevant:
For anyone who stumbles on the issue, here is the code we are using inside cronjob (that runs every minute) to restart affected pods:
|
I'm experiencing the exact same issue with both stable-2.11.2 and the latest edge-22.6.1 on Azure AKS. All Java applications running HTTP/1. Protocol detection seems to be okay: I tried adding the container port to the opaque ports list through an annotation but this didn't appear to have much effect. Tried enabling trace logs but the output was huge! If anyone can suggest a filter that can be applied to better target the trace logs then I can give that a try. |
Hi @Aleksei-Poliakov and @johnswarbrick, out of interest, do you know what instance types and AMIs your EKS worker nodes are? Thanks! |
Hi @hawkw. We are using Azure AKS. Standard_D8ads_v5 running AKSUbuntu-1804gen2containerd-2022.04.13. |
@johnswarbrick thanks — I must have misread your original post, I thought you were both on EKS. My bad! |
Attached logfiles in debug level. |
@johnswarbrick I noticed (as described in #8677) that your proxy is logging service discovery updates pretty aggressively. While these shouldn't cause the problems you're seeing, I'd like to understand more about what's causing them. Is there anything interesting in the output of |
@johnswarbrick The original problem happened on AMI 1.19.6-20210526 (this is what's used in production) and I have been running my experiments on 1.19.15-20211206. I will do another round of tests on the AMI that matches version in production and report back. |
Hi @Aleksei-Poliakov, @johnswarbrick, and others, I've just published a debug build of the proxy with some additional debug logging for these issues. We'd love it if you could help us out by testing out this image and sending us the logs from any proxy instances that hit this crash. When deploying this image, make sure to set the proxy log level to something that includes You can use this proxy image by adding these annotations to a workload or namespace: annotations:
config.linkerd.io/proxy-image: mycoliza/l2-proxy
config.linkerd.io/proxy-version: ready-cache-debug.ff808285a
config.linkerd.io/proxy-log-level: "warn,linkerd=debug,tower::ready_cache=trace" or, you can run $ linkerd inject \
--proxy-image "mycoliza/l2-proxy" \
--proxy-version "ready-cache-debug.ff808285a" \
--proxy-log-level "warn,linkerd=debug,tower::ready_cache=trace" \
<OTHER ARGS> Thanks in advance --- your help debugging this is greatly appreciated! |
We have a isolated a similar repro and fixed it in tower-rs/tower#668 linkerd/linkerd2-proxy#1753. The fix can be tested with: annotations:
config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
config.linkerd.io/proxy-version: ver.tower-rc-fix.b296154c |
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
Hello @olix0r. Thank you for the patch, will test and revert asap! Does this patch address both the "panicked at 'cancel sender lost'" and the overly aggressive service discovery updates? |
@johnswarbrick-napier The proxy change addresses the panics. I believe that the frequent discovery updates made the proxy more likely to trigger the panic behavior, but they shouldn't cause problems if the panic behavior is fixed. We'll probably look into the discovery updates more as we have time, especially if we're able to narrow down a repro. If you can confirm that the proxy change addresses the bug, we'll probably release a 2.11.3 with the panic fix. |
I've applied these annotations:
...and for the past hour I've not seen any occurrences of either of these log entries:
I will continue to monitor for 24 hours and report back, but so far this patch version is looking very promising. |
Hello @olix0r - I can confirm the patch appears to be successful! I've been running load tests for the past day and everything appears to be working and stable. If you could push a 2.11.3 release that would be fantastic as it would be an easy upgrade to our Production clusters that have the current stable release. Thank you so much for your quick help and support. |
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
This is fixed in |
Discussed in #8641
Originally posted by DarkWalker June 9, 2022
What's the issue
linkerd-proxy goes into a "broken" state after which all requests to a specific downstream service within the mesh fail with either INTERNAL or CANCELLED error (observed by the caller). Requests never make it to the target service.
Details about the setup
How the problem manifests itself
What we see in linkerd-proxy container logs of affected pods
first, there is an error
thread 'main' panicked at 'cancel sender lost', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.12/src/ready_cache/cache.rs:415:13
- this message is printed once when pod goes into the bad statethen following the error above there is one of two errors for every outgoing request from this pod:
either
INFO ThreadId(01) outbound:server{orig_dst=10.0.0.1:80}:rescue{client.addr=172.16.0.1:35024}: linkerd_app_core::errors::respond: Request failed error=buffer's worker closed unexpectedly
, orthis one
WARN ThreadId(01) outbound:server{orig_dst=10.0.0.1:80}:rescue{client.addr=172.16.0.1:49336}: linkerd_app_outbound::http::server: Unexpected error error=buffered service failed: panic.
buffer's worker closed unexpectedly
is present in kube-proxy logs of unaffected pods which function just fine, butcancel sender lost
only appears on the pods when they go into the bad stateObservations
Attached are logs from running
curl -v --data 'linkerd=trace,tower=trace' -X PUT localhost:28014/proxy-log-level
between twobuffer's worker closed unexpectedly
which are followed byHandling error on gRPC connection code=Internal error
, which seems to be what we observe on the application side (the IP addresses as well as some bits are redacted for security purposes, please let me know if it's a problem).linkerd_proxy_trace.txt
I would appreciate any help in debugging this issue or tips on how to workaround/repro this in an isolated environment.
The text was updated successfully, but these errors were encountered: