Outbound requests for some downstream services consistently fail after the proxy reported "panicked at 'cancel sender lost'" #8666

adleong · 2022-06-14T16:40:26Z

Discussed in #8641

^{Originally posted by DarkWalker June 9, 2022}

What's the issue

linkerd-proxy goes into a "broken" state after which all requests to a specific downstream service within the mesh fail with either INTERNAL or CANCELLED error (observed by the caller). Requests never make it to the target service.

Details about the setup

there are a total of 16 services in the mesh with 2 to 60 pods each, the total number of pods in the mesh is 278
all the services use grpc; LinkerD is used to to encrypt the traffic, other than that there is no encryption within the cluster
we recently migrated from 2.10 to 2.11.2 and this seems to be related, but there are other things in the cluster that changed recently, so we rolled everything back and trying to introduce changes one by one, we will have the same setup on 2.10 in about a week and then I will be able to tell for sure whenever it's 2.11.2 related or if we see a similar issue on 2.10

How the problem manifests itself

only some of the services within the mesh are affected and this seems to be pretty consistent across all the incidents (we had 3 so far); on top of it only certain pods are affected (e.g. for one service 3 out of 3 pods were affected, but for another 9 out of 12 were affected during first incident, then 3 out of 12 for the second one and then 5/12 for the 3rd)
when pods go into a bad state they do not recover automatically - one of the affected services was not critical, so I left 2 out of 3 pods in a bad state and they stayed in the bad state for more than 24 hours before we rolled back linkerd and had to restart all the deployments
the mitigation is to delete the pod (and then deployment recreates it), the new pods that come up work just fine
errors observed by grpc interceptor over the client report CANCELLED and INTERNAL errors
this happens only in production, we did a fair amount of load testing prior to the upgrade as well as deliberately trying to repro the issue after it happened the first time - but no repro so far outside of production load

What we see in linkerd-proxy container logs of affected pods

first, there is an error thread 'main' panicked at 'cancel sender lost', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.12/src/ready_cache/cache.rs:415:13 - this message is printed once when pod goes into the bad state
then following the error above there is one of two errors for every outgoing request from this pod:

either INFO ThreadId(01) outbound:server{orig_dst=10.0.0.1:80}:rescue{client.addr=172.16.0.1:35024}: linkerd_app_core::errors::respond: Request failed error=buffer's worker closed unexpectedly, or
this one WARN ThreadId(01) outbound:server{orig_dst=10.0.0.1:80}:rescue{client.addr=172.16.0.1:49336}: linkerd_app_outbound::http::server: Unexpected error error=buffered service failed: panic.

buffer's worker closed unexpectedly is present in kube-proxy logs of unaffected pods which function just fine, but cancel sender lost only appears on the pods when they go into the bad state

Observations

Right before it happens, each time there are multiple instances of the downstream service start failing readiness probe (intentionally, this is how we control pods getting out of rotation when they get to the point of perf degradation due to large amount of traffic) - so it seems like right after a bunch of pods start failing readiness probe - linkerd-proxy on certain pods for services that depend on this one go into bad state and does not self-heal after everything is back to reporting "ready".
I can't be sure whenever there was a moment where all the replicas of that service failed the readiness check or not (so it's possible there was a brief period of downtime, not more than 10 seconds, where there was not a single replica available for the service; it may also be true that any given moment there was at least one replica available - our metrics have a resolution of 30 seconds and all client calls are retried, so the best conclusion I can make is that the total downtime, if it happened, was no longer than 10 seconds)
Out of 7 consumers only 3 were affected - and this has been consistent across all 3 incidents. Non-affected services were making either around 40,000 RPS (with 1,000 RPS from every replica for one service and 3,000 RSP from every replica on another one) or din't make any calls at all when the incidents happened (but made calls less than a minute after and those calls worked fine), affected services were making around 50 RPS (with ~4 RPS from every replica)
This issue seems to be similar to Outbound requests fail intermittently after the proxy reported "panicked at 'cancel sender lost'" #6086

Attached are logs from running curl -v --data 'linkerd=trace,tower=trace' -X PUT localhost:28014/proxy-log-level between two buffer's worker closed unexpectedly which are followed by Handling error on gRPC connection code=Internal error, which seems to be what we observe on the application side (the IP addresses as well as some bits are redacted for security purposes, please let me know if it's a problem).

linkerd_proxy_trace.txt

I would appreciate any help in debugging this issue or tips on how to workaround/repro this in an isolated environment.

The text was updated successfully, but these errors were encountered:

olix0r · 2022-06-14T16:46:28Z

Related to #6086 tower-rs/tower#415

Aleksei-Poliakov · 2022-06-14T22:13:30Z

Hi!

A few updates:

We rolled back to 10.2 and still see the same issue happening.
With the mitigation in place that just restarts effected pods - we had more than 20 instances of this happening in production today, so if you need more logs or have ideas on what to try - I will likely be able to do so.

Other information that might be relevant:

we are running on AWS EKS 1.19
it looks like sometimes 2.10 fails with a similar, but a different issue - I will dive into this more tomorrow, but so far sometimes there is no "panicked at 'cancel sender lost", yet a pod goes into a broken state and gRPC observes UNAVAILABLE errors. Other times the issue is the same one we saw on 2.11.2 and the mitigation below seems to help.

For anyone who stumbles on the issue, here is the code we are using inside cronjob (that runs every minute) to restart affected pods:

for pod in `kubectl get pods | grep 2/2 | awk '{print $1}'`; do
  if kubectl logs $pod -c linkerd-proxy --since=2m | grep -q "thread 'main' panicked at 'cancel sender lost'"; then
    kubectl delete pods $pod;
  fi;
done;

johnswarbrick · 2022-06-14T22:59:48Z

I'm experiencing the exact same issue with both stable-2.11.2 and the latest edge-22.6.1 on Azure AKS.

All Java applications running HTTP/1. Protocol detection seems to be okay:

I tried adding the container port to the opaque ports list through an annotation but this didn't appear to have much effect.

Tried enabling trace logs but the output was huge! If anyone can suggest a filter that can be applied to better target the trace logs then I can give that a try.

hawkw · 2022-06-15T20:23:41Z

Hi @Aleksei-Poliakov and @johnswarbrick, out of interest, do you know what instance types and AMIs your EKS worker nodes are? Thanks!

johnswarbrick · 2022-06-15T21:33:40Z

Hi @hawkw. We are using Azure AKS. Standard_D8ads_v5 running AKSUbuntu-1804gen2containerd-2022.04.13.

hawkw · 2022-06-15T21:36:09Z

@johnswarbrick thanks — I must have misread your original post, I thought you were both on EKS. My bad!

johnswarbrick · 2022-06-15T22:23:59Z

Attached logfiles in debug level.

linkerd-debug-log.txt

olix0r · 2022-06-16T17:40:33Z

@johnswarbrick I noticed (as described in #8677) that your proxy is logging service discovery updates pretty aggressively. While these shouldn't cause the problems you're seeing, I'd like to understand more about what's causing them. Is there anything interesting in the output of k logs -fn linkerd deploy/linkerd-destination destination?

Aleksei-Poliakov · 2022-06-16T18:10:30Z

@johnswarbrick The original problem happened on AMI 1.19.6-20210526 (this is what's used in production) and I have been running my experiments on 1.19.15-20211206. I will do another round of tests on the AMI that matches version in production and report back.

hawkw · 2022-06-16T18:20:37Z

Hi @Aleksei-Poliakov, @johnswarbrick, and others, I've just published a debug build of the proxy with some additional debug logging for these issues. We'd love it if you could help us out by testing out this image and sending us the logs from any proxy instances that hit this crash. When deploying this image, make sure to set the proxy log level to something that includes tower::ready_cache=trace, such as warn,linkerd=debug,tower::ready_cache=trace.

You can use this proxy image by adding these annotations to a workload or namespace:

annotations:
  config.linkerd.io/proxy-image: mycoliza/l2-proxy
  config.linkerd.io/proxy-version: ready-cache-debug.ff808285a
  config.linkerd.io/proxy-log-level: "warn,linkerd=debug,tower::ready_cache=trace"

or, you can run linkerd inject:

$ linkerd inject \
    --proxy-image "mycoliza/l2-proxy" \
    --proxy-version "ready-cache-debug.ff808285a" \
    --proxy-log-level "warn,linkerd=debug,tower::ready_cache=trace" \
    <OTHER ARGS>

Thanks in advance --- your help debugging this is greatly appreciated!

olix0r · 2022-06-17T03:02:46Z

We have a isolated a similar repro and fixed it in tower-rs/tower#668 linkerd/linkerd2-proxy#1753. The fix can be tested with:

annotations:
  config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
  config.linkerd.io/proxy-version: ver.tower-rc-fix.b296154c

Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13

johnswarbrick-napier · 2022-06-17T22:25:24Z

Hello @olix0r. Thank you for the patch, will test and revert asap!

Does this patch address both the "panicked at 'cancel sender lost'" and the overly aggressive service discovery updates?

olix0r · 2022-06-17T22:35:33Z

@johnswarbrick-napier The proxy change addresses the panics. I believe that the frequent discovery updates made the proxy more likely to trigger the panic behavior, but they shouldn't cause problems if the panic behavior is fixed. We'll probably look into the discovery updates more as we have time, especially if we're able to narrow down a repro.

If you can confirm that the proxy change addresses the bug, we'll probably release a 2.11.3 with the panic fix.

johnswarbrick · 2022-06-17T23:38:31Z

I've applied these annotations:

  config.linkerd.io/proxy-image: ghcr.io/olix0r/l2-proxy
  config.linkerd.io/proxy-version: ver.tower-rc-fix.b296154c

...and for the past hour I've not seen any occurrences of either of these log entries:

thread 'main' panicked at 'cancel sender lost', /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/tower-0.4.12/src/ready_cache/cache.rs:415:13

linkerd_app_core::serve: Server failed to become ready error=buffered service failed: buffered service failed: task 28182 panicked error.sources=[buffered service failed: task 28182 panicked, task 28182 panicked] client.addr=10.244.3.95:49580

I will continue to monitor for 24 hours and report back, but so far this patch version is looking very promising.

johnswarbrick · 2022-06-19T00:46:52Z

Hello @olix0r - I can confirm the patch appears to be successful!

I've been running load tests for the past day and everything appears to be working and stable.

If you could push a 2.11.3 release that would be fantastic as it would be an easy upgrade to our Production clusters that have the current stable release.

Thank you so much for your quick help and support.

Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13

hawkw · 2022-07-06T18:26:41Z

This is fixed in stable-2.11.3.

adleong assigned hawkw Jun 14, 2022

olix0r mentioned this issue Jun 16, 2022

destination: Frequent redundant discovery updates #8677

Closed

hawkw mentioned this issue Jun 17, 2022

update Tower to 0.4.13 to fix load balancer panic linkerd/linkerd2-proxy#1758

Merged

olix0r closed this as completed in linkerd/linkerd2-proxy#1758 Jun 17, 2022

olix0r reopened this Jun 17, 2022

olix0r added this to the stable-2.11.3 milestone Jun 21, 2022

kleimkuhler mentioned this issue Jun 30, 2022

stable-2.11.3 release #8779

Closed

13 tasks

hawkw mentioned this issue Jun 30, 2022

2.141.3: cherry-pick #1758 linkerd/linkerd2-proxy#1800

Closed

hawkw mentioned this issue Jun 30, 2022

2.161.3: cherry-pick #1758 linkerd/linkerd2-proxy#1801

Merged

hawkw closed this as completed Jul 6, 2022

github-actions bot locked as resolved and limited conversation to collaborators Aug 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Outbound requests for some downstream services consistently fail after the proxy reported "panicked at 'cancel sender lost'" #8666

Outbound requests for some downstream services consistently fail after the proxy reported "panicked at 'cancel sender lost'" #8666

adleong commented Jun 14, 2022

What's the issue

Details about the setup

How the problem manifests itself

What we see in linkerd-proxy container logs of affected pods

Observations

olix0r commented Jun 14, 2022

Aleksei-Poliakov commented Jun 14, 2022 •

edited

Loading

johnswarbrick commented Jun 14, 2022 •

edited

Loading

hawkw commented Jun 15, 2022 •

edited

Loading

johnswarbrick commented Jun 15, 2022

hawkw commented Jun 15, 2022

johnswarbrick commented Jun 15, 2022

olix0r commented Jun 16, 2022

Aleksei-Poliakov commented Jun 16, 2022 •

edited

Loading

hawkw commented Jun 16, 2022 •

edited

Loading

olix0r commented Jun 17, 2022 •

edited

Loading

johnswarbrick-napier commented Jun 17, 2022

olix0r commented Jun 17, 2022

johnswarbrick commented Jun 17, 2022

johnswarbrick commented Jun 19, 2022

hawkw commented Jul 6, 2022

Outbound requests for some downstream services consistently fail after the proxy reported "panicked at 'cancel sender lost'" #8666

Outbound requests for some downstream services consistently fail after the proxy reported "panicked at 'cancel sender lost'" #8666

Comments

adleong commented Jun 14, 2022

Discussed in #8641

What's the issue

Details about the setup

How the problem manifests itself

What we see in linkerd-proxy container logs of affected pods

Observations

olix0r commented Jun 14, 2022

Aleksei-Poliakov commented Jun 14, 2022 • edited Loading

johnswarbrick commented Jun 14, 2022 • edited Loading

hawkw commented Jun 15, 2022 • edited Loading

johnswarbrick commented Jun 15, 2022

hawkw commented Jun 15, 2022

johnswarbrick commented Jun 15, 2022

olix0r commented Jun 16, 2022

Aleksei-Poliakov commented Jun 16, 2022 • edited Loading

hawkw commented Jun 16, 2022 • edited Loading

olix0r commented Jun 17, 2022 • edited Loading

johnswarbrick-napier commented Jun 17, 2022

olix0r commented Jun 17, 2022

johnswarbrick commented Jun 17, 2022

johnswarbrick commented Jun 19, 2022

hawkw commented Jul 6, 2022

Aleksei-Poliakov commented Jun 14, 2022 •

edited

Loading

johnswarbrick commented Jun 14, 2022 •

edited

Loading

hawkw commented Jun 15, 2022 •

edited

Loading

Aleksei-Poliakov commented Jun 16, 2022 •

edited

Loading

hawkw commented Jun 16, 2022 •

edited

Loading

olix0r commented Jun 17, 2022 •

edited

Loading