-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cleanup and fix cycle handling #285
Conversation
If I am not in an error, all current tests uses "static" cycles -- a cycle always present. Let's spice that up by adding conditional cycles: cycles that appear only for specific impls
This allows us to figure out whether a query can recover from a cycle (and how) without invoking the `recover` function.
Find the cycle recovery strategy for a given DatabaseKey.
They now use signals to guarantee we are testing the code paths we want to be testing.
Rather than checking return value of from `Q::cycle_fallback`, we now consult the computed recovery strategy to decide whether to panic or to recover. We can thus assume that we will successfully recover and don't need to check for `None` results anymore.
Before we could not observe the case where: * thread A is blocked on B * cycle detected in thread B * some participants are on thread A and have to be marked (In particular, I commented out some code that seemed necessary and didn't see any tests fail)
Instead of sending the result back, just have the waiting threads retry reading the cache.
Being generic over the keys made code harder to read.
Make `add_edge` infallible
This will allow me to add condvar logic to it
We are going to make it so that the runtime coordinates delivery of the WaitResults.
Thi sis an intermediate step towards having the runtime coordinate wakeups.
Instead of creating a future for each edge in the graph, we now have all dependent queries block in the runtime and wake up whenever results are published to see if their results are ready. We could certainly allocate a CondVar for each dependent query if we found that spurious wakeups were a problem. I consider this highly unlikely in practice.
Currently, when one thread blocks on another, we clone the stack from that task. This results in a lot of clones, but it also means that we can't mark all the frames involved in a cycle atomically. Instead, when we propagate information between threads, we also propagate the participants of the cycle and so forth. This branch *moves* the stack into the runtime while a thread is blocked, and then moves it back out when the thread resumes. This permits the runtime to mark all the cycle participants at once. It also avoids cloning.
I was finding the parallel test setup hard to read, so everything relating to one test is now in a single file, with shorter names.
The previous version had a lot of spurious wakeups, but it also resisted the refactoring I'm about to do. =)
b2bbd0e
to
382147f
Compare
Including the corner case where the active thread does not have recovery.
382147f
to
b8f628d
Compare
@matklad ok, the code now supports cycle fallback so long as any one query has it. I also added tests for all the wicked situations I could come up with. The main thing that's not done is updating the RFC and Salsa book to describe how the algorithm works. |
Although in the process of writing those comments I already found one bug (the fix for which is already part of these commits), so I wouldn't be surprised if doing more writing helped to flush out more bugs. |
When a query Q invokes a cycle Q1...Q1 but Q is not a participant in that cycle, Q should not recover! Test that.
OK, the RFC is more up to date now. |
Co-authored-by: Aleksey Kladov <[email protected]>
@matklad I believe I have addressed all of your review comments. |
f837d99
to
fc020de
Compare
I discussed this with @matklad and we agreed to merge this PR. bors r+ |
Build succeeded: |
This PR is...kind of long. It reshapes the core logic of salsa to fix various bugs in cycle handling and generally simplify how we handle cross-thread coordination.
Best read commit by commit: every commit passes all tests, afaik.
The core bug I was taking aim at was the fact that, when you invoke
maybe_changed_since
, you can sometimes wind up detecting a cycle without having pushed some of the relevant queries onto the stack. This is now fixed.From a user's POV,
nothing changes from this PR, there are only minimal changes to the public interface. The biggest one is that recover functions now get a&salsa::Cycle
which has methods for accessing the participants; the other is that queries that are participating in cycle fallback will use unwinding to avoid executing past the point where the cycle is discovered. Otherwise, things work the same as before:#[salsa::recover]
, then they will take on the recovery value. (At the moment, they continue executing after the cycle is observed, but their final result is ignored; I plan to change this in a follow-up PR, or maybe some future commit to this PR.)#[salsa::recover]
, then the code panics. This is treated like any other panic, cancelling all other work.Along the way, I made... a few... other changes:
maybe_changed_since
vsread
, but it's also more compatible with some of the directions I have in mind.maybe_changed_since
has been re-implemented in terms of the same core building blocks asread
(probe
and friends). I originally tried to unify them, but I realized that they behave somewhat differently from one another and both of them make sense. (In particular, we want to be able to free values with the LRU cache while still checking if they are up to date.)Ah, I realize now that I had planned to write a bunch of docs in the salsa book before I landed this. Well, I'm going to open the PR anyway, as I've let this branch go far too long.
r? @matklad