-
Notifications
You must be signed in to change notification settings - Fork 285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Panicked at 'missing cancelation' in tower-ready-cache #415
Comments
I can replicate this pretty reliably at high load in |
@olix0r is there any particular state you'd like me to print out at the time of the panic that might aid in debugging this? I can only replicate the issue at fairly high load, so full logs will be tricky (and probably not very useful). |
As an aside, I cannot replicate this easily on a local machine, even when it is heavily loaded, but it happens on like 1/3 or runs on EC2, which suggests that it is indeed a race condition triggered by scheduling noise (which EC2 has a lot of). |
I'm quite stuck on this... @olix0r any suggestions for how to go about debugging this? Any idea what the root cause might be? |
The hope for this was to reproduce #415 (which it does not sadly), but at least it adds a test for p2c!
I've been stepping through the code, and cannot understand how the panic can possibly be triggered.. For that to happen, we need to have an entry in A key
Let's look at each of these in turn. Case 1: A service with key
Case 2: A service with key Case 3: If a service with key So, none of these cases allow the panic to trigger. Yet it does. So, I must be missing something. I just don't know what. My one thought was that things would go wrong if polling a |
To state ^ a bit more concisely, there doesn't seem to be a way to have multiple services in |
The hope for this was to reproduce #415 (which it does not sadly), but at least it adds a test for p2c!
This is, indeed, a very weird error. I have to admit ignorance to the newer std::future APIs. I would personally want to look at the translation from futures 0.1 to see if there's anything that changed semantically. Or perhaps there's a notification race that we haven't seen before. I'll refresh myself and see if I can come up with a better theory. |
The translation seems correct to me -- it doesn't really change the logic anywhere, and the move to pin is pretty much irrelevant for tower/tower-ready-cache/src/cache.rs Lines 354 to 355 in 0d2a377
whereas the new one has tower/tower-ready-cache/src/cache.rs Line 366 in e2f1a49
So if the oneshot sender is dropped, the old code would panic, whereas the new code would continue. Given that I wasn't seeing panics before though, I doubt this change is the culprit (although it probably should be changed to panic again). It could be that the 0.2 scheduler has more weird interleavings or notification patterns, though given that all the methods on |
@jonhoo Your descriptions above match my understanding exactly. Thinking outside the box a bit... there are a few assumptions we could violate to produce this result, however:
I've looked at the futuresunordered impl and it doesn't seem to do any sort of buffering of results, so that seems unlikely. Could it be possible that tokio::sync::oneshot does not reliably notify receivers under load? |
Yeah, I have looked at the
Though that execution seems insane. It would mean that writing to a field through |
As an aside, this loom test passes, which gives me some confidence that operating on #[test]
fn future_poll_happens_before() {
use crate::sync::oneshot;
loom::model(|| {
let mut pool = mk_pool(2);
let (tx, rx) = oneshot::channel();
let jh = pool.spawn(async move {
let mut x = 0;
x = 1;
rx.await.unwrap();
assert_eq!(x, 1);
});
tx.send(()).unwrap();
pool.block_on(jh).unwrap();
});
} Also, I filed #418 to restore the |
Given that I can reproduce the error on EC2 under load, are there any assertions or other code snippets that would be useful if I tried to inject and then run where it fails? |
Also, I don't love this option, but these But, we are somehow violating an invariant, so it would be good to get to the bottom of this. If I were debugging this, I'd probably add a |
Is it possible that we are somehow mismatching cancelation pairs? I.e. that the tower/tower-ready-cache/src/cache.rs Lines 234 to 239 in 47c3a14
|
I can't see how it would be possible... but it would be helpful to see the ordering of cancelation and poll_pending results immediately before a panic |
Yeah, I also don't see how that could happen, because the service that is in
I don't think those
If you have a patch in mind I'd be happy to run with it applied. It might generate some pretty large logs though, especially because there are multiple |
Here's a quick attempt to fill in some gaps: https://github.com/olix0r/tower/pull/new/ver/ready-cache-debug Note that, if you are using |
I'll run with that once I get into the office tomorrow, thanks! It's, err, tricky to set spans appropriately because the |
Interestingly, I found that I was able to reproduce the issue locally too if I just load the CPU enough ( |
I modified your log statements slightly to also include the failing key. |
Here's the log for just the failing key (key 1): https://jon.tsp.io/share/tower-ready-cache-panic-key.log |
I'm seeing a lot of
Which seems surprising. |
Ah, I think that one trace line is just placed incorrectly (it should be in an |
Fixed (and updated the links above). |
Here are the last few events for the failing key:
Where All the other evictions look like we'd expect:
|
As a sanity check, I also tried storing a copy of the service key inside I also had a careful read through the code for This in turn lead me to wonder whether |
And here's another: even polling the |
Another data-point: adding a |
And here's another: if I replace |
It's been observed that occasionally tower-ready-cache would panic trying to find an already canceled service in `cancel_pending_txs` (#415). The source of the race is not entirely clear, but extensive debugging demonstrated that occasionally a call to `evict` would send on the `CancelTx` for a service, yet that service would be yielded back from `pending` in `poll_pending` in a non-`Canceled` state. This is equivalent to saying that this code may panic: ```rust async { let (tx, rx) = oneshot::channel(); tx.send(42).unwrap(); yield_once().await; rx.try_recv().unwrap(); // <- may occasionally panic } ``` I have not been able to demonstrate a self-contained example failing in this way, but it's the only explanation I have found for the observed bug. Pinning the entire runtime to one core still produced the bug, which indicates that it is not a memory ordering issue. Replacing oneshot with `mpsc::channel(1)` still produced the bug, which indicates that the bug is not with the implementation of `oneshot`. Logs also indicate that the `ChannelTx` we send on in `evict()` truly is the same one associated with the `ChannelRx` polled in `Pending::poll`, so we're not getting our wires crossed somewhere. It truly is bizarre. This patch resolves the issue by considering a failure to find a ready/errored service's `CancelTx` as another signal that a service has been removed. Specifically, if `poll_pending` finds a service that returns `Ok` or `Err`, but does _not_ find its `CancelTx`, then it assumes that it must be because the service _was_ canceled, but did not observe that cancellation signal. As an explanation, this isn't entirely satisfactory, since we do not fully understand the underlying problem. It _may_ be that a canceled service could remain in the pending state for a very long time if it does not become ready _and_ does not see the cancellation signal (so it returns `Poll::Pending` and is not removed). That, in turn, might cause an issue if the driver of the `ReadyCache` then chooses to re-use a key they believe they have evicted. However, any such case _must_ first hit the panic that exists in the code today, so this is still an improvement over the status quo. Fixes #415.
It's been observed that occasionally tower-ready-cache would panic trying to find an already canceled service in `cancel_pending_txs` (#415). The source of the race is not entirely clear, but extensive debugging demonstrated that occasionally a call to `evict` would send on the `CancelTx` for a service, yet that service would be yielded back from `pending` in `poll_pending` in a non-`Canceled` state. This is equivalent to saying that this code may panic: ```rust async { let (tx, rx) = oneshot::channel(); tx.send(42).unwrap(); yield_once().await; rx.try_recv().unwrap(); // <- may occasionally panic } ``` I have not been able to demonstrate a self-contained example failing in this way, but it's the only explanation I have found for the observed bug. Pinning the entire runtime to one core still produced the bug, which indicates that it is not a memory ordering issue. Replacing oneshot with `mpsc::channel(1)` still produced the bug, which indicates that the bug is not with the implementation of `oneshot`. Logs also indicate that the `ChannelTx` we send on in `evict()` truly is the same one associated with the `ChannelRx` polled in `Pending::poll`, so we're not getting our wires crossed somewhere. It truly is bizarre. This patch resolves the issue by considering a failure to find a ready/errored service's `CancelTx` as another signal that a service has been removed. Specifically, if `poll_pending` finds a service that returns `Ok` or `Err`, but does _not_ find its `CancelTx`, then it assumes that it must be because the service _was_ canceled, but did not observe that cancellation signal. As an explanation, this isn't entirely satisfactory, since we do not fully understand the underlying problem. It _may_ be that a canceled service could remain in the pending state for a very long time if it does not become ready _and_ does not see the cancellation signal (so it returns `Poll::Pending` and is not removed). That, in turn, might cause an issue if the driver of the `ReadyCache` then chooses to re-use a key they believe they have evicted. However, any such case _must_ first hit the panic that exists in the code today, so this is still an improvement over the status quo. Fixes #415.
linkerd/linkerd2#6086 describes an issue that sounds closely related to tower-rs#415: There's some sort of consistency issue between the ready-cache's pending stream and its set of cancelations. Where the latter issues describes triggering a panic in the stream receiver, the former describes triggering a panic in the stream implementation. There's no logical reason why we can't continue to operate in this scenario, though it does indicate a real correctness issue. So, this change prevents panicking in this scenario when not building with debugging. Instead, we now emit WARN-level logs so that we have a clearer signal they're occurring. Finally, this change also adds `Debug` constraints to the cache's key types (and hence the balancer's key types) so that we can more reasonably debug this behavior.
linkerd/linkerd2#6086 describes an issue that sounds closely related to tower-rs#415: There's some sort of consistency issue between the ready-cache's pending stream and its set of cancelations. Where the latter issues describes triggering a panic in the stream receiver, the former describes triggering a panic in the stream implementation. There's no logical reason why we can't continue to operate in this scenario, though it does indicate a real correctness issue. So, this change prevents panicking in this scenario when not building with debugging. Instead, we now emit WARN-level logs so that we have a clearer signal they're occurring. Finally, this change also adds `Debug` constraints to the cache's key types (and hence the balancer's key types) so that we can more reasonably debug this behavior.
Signed-off-by: Eliza Weisman <[email protected]>
`tokio::task` enforces a cooperative scheduling regime that can cause `oneshot::Receiver::poll` to return pending after the sender has sent an update. `ReadyCache` uses a oneshot to notify pending services that they should not become ready. When a cancelation is not observed, the ready cache return service instances that should have been canceled, which breaks assumptions and causes an invalid state. Fixes #415 Co-authored-by: Eliza Weisman <[email protected]>
`tokio::task` enforces a cooperative scheduling regime that can cause `oneshot::Receiver::poll` to return pending after the sender has sent an update. `ReadyCache` uses a oneshot to notify pending services that they should not become ready. When a cancelation is not observed, the ready cache return service instances that should have been canceled, which breaks assumptions and causes an invalid state. This branch replaces the use of `tokio::sync::oneshot` for canceling pending futures with a custom cancelation handle using an `AtomicBool` and `futures::task::AtomicWaker`. This ensures that canceled `Pending` services are always woken even when the task's budget is exceeded. Additionally, cancelation status is now always known to the `Pending` future, by checking the `AtomicBool` immediately on polls, even in cases where the canceled `Pending` future was woken by the inner `Service` becoming ready, rather than by the cancelation. Fixes #415 Signed-off-by: Oliver Gould <[email protected]> Co-authored-by: Eliza Weisman <[email protected]>
`tokio::task` enforces a cooperative scheduling regime that can cause `oneshot::Receiver::poll` to return pending after the sender has sent an update. `ReadyCache` uses a oneshot to notify pending services that they should not become ready. When a cancelation is not observed, the ready cache return service instances that should have been canceled, which breaks assumptions and causes an invalid state. This branch replaces the use of `tokio::sync::oneshot` for canceling pending futures with a custom cancelation handle using an `AtomicBool` and `futures::task::AtomicWaker`. This ensures that canceled `Pending` services are always woken even when the task's budget is exceeded. Additionally, cancelation status is now always known to the `Pending` future, by checking the `AtomicBool` immediately on polls, even in cases where the canceled `Pending` future was woken by the inner `Service` becoming ready, rather than by the cancelation. Fixes #415 Signed-off-by: Oliver Gould <[email protected]> Co-authored-by: Eliza Weisman <[email protected]>
`tokio::task` enforces a cooperative scheduling regime that can cause `oneshot::Receiver::poll` to return pending after the sender has sent an update. `ReadyCache` uses a oneshot to notify pending services that they should not become ready. When a cancelation is not observed, the ready cache return service instances that should have been canceled, which breaks assumptions and causes an invalid state. This branch replaces the use of `tokio::sync::oneshot` for canceling pending futures with a custom cancelation handle using an `AtomicBool` and `futures::task::AtomicWaker`. This ensures that canceled `Pending` services are always woken even when the task's budget is exceeded. Additionally, cancelation status is now always known to the `Pending` future, by checking the `AtomicBool` immediately on polls, even in cases where the canceled `Pending` future was woken by the inner `Service` becoming ready, rather than by the cancelation. Fixes #415 Signed-off-by: Oliver Gould <[email protected]> Co-authored-by: Eliza Weisman <[email protected]>
# 0.4.13 (June 17, 2022) ### Added - **load_shed**: Public constructor for `Overloaded` error ([#661]) ### Fixed - **util**: Fix hang with `call_all` when the `Stream` of requests is pending ([#656]) - **ready_cache**: Ensure cancelation is observed by pending services ([#668], fixes [#415]) - **docs**: Fix a missing section header due to a typo ([#646]) - **docs**: Fix broken links to `Service` trait ([#659]) [#661]: #661 [#656]: #656 [#668]: #668 [#415]: #415 [#646]: #646 [#659]: #659
# 0.4.13 (June 17, 2022) ### Added - **load_shed**: Public constructor for `Overloaded` error ([#661]) ### Fixed - **util**: Fix hang with `call_all` when the `Stream` of requests is pending ([#656]) - **ready_cache**: Ensure cancelation is observed by pending services ([#668], fixes [#415]) - **docs**: Fix a missing section header due to a typo ([#646]) - **docs**: Fix broken links to `Service` trait ([#659]) [#661]: #661 [#656]: #656 [#668]: #668 [#415]: #415 [#646]: #646 [#659]: #659
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
Tower [v0.4.13] includes a fix for a bug in the `tower::ready_cache` module, tower-rs/tower#415. The `ready_cache` module is used internally in Tower's load balancer. This bug resulted in panics in the proxy (linkerd/linkerd2#8666, linkerd/linkerd2#6086) in cases where the Destination service sends a very large number of service discovery updates (see linkerd/linkerd2#8677). This commit updates the proxy's dependency on `tower` to 0.4.13, to ensure that this bugfix is picked up. Fixes linkerd/linkerd2#8666 Fixes linkerd/linkerd2#6086 [v0.4.13]: https://github.com/tower-rs/tower/releases/tag/tower-0.4.13
I just had my code hit this
.expect
intower-ready-cache
. It's usingtower-balance::pool
(and thus alsotower-balance::p2c
). Not sure what caused it, but figured maybe @olix0r can tease it out from the backtrace?The text was updated successfully, but these errors were encountered: