Cleanup and fix cycle handling #285

nikomatsakis · 2021-10-30T15:54:27Z

This PR is...kind of long. It reshapes the core logic of salsa to fix various bugs in cycle handling and generally simplify how we handle cross-thread coordination.

Best read commit by commit: every commit passes all tests, afaik.

The core bug I was taking aim at was the fact that, when you invoke maybe_changed_since, you can sometimes wind up detecting a cycle without having pushed some of the relevant queries onto the stack. This is now fixed.

From a user's POV, ~~nothing changes from this PR~~, there are only minimal changes to the public interface. The biggest one is that recover functions now get a &salsa::Cycle which has methods for accessing the participants; the other is that queries that are participating in cycle fallback will use unwinding to avoid executing past the point where the cycle is discovered. Otherwise, things work the same as before:

If you encounter a cycle and all participant queries are marked with #[salsa::recover], then they will take on the recovery value. (At the moment, they continue executing after the cycle is observed, but their final result is ignored; I plan to change this in a follow-up PR, or maybe some future commit to this PR.)
If you encounter a cycle and some or all participants are NOT marked with #[salsa::recover], then the code panics. This is treated like any other panic, cancelling all other work.

Along the way, I made... a few... other changes:

Cross-thread handling is simplified. When we block on another thread, it no longer sends us a final result. Instead, it just gets re-awoken and then it retries the original request. This is helpful when you encounter cycles in maybe_changed_since vs read, but it's also more compatible with some of the directions I have in mind.
Cycle detection is simplified and more centrally coordinated. Previously, when a cycle was detected, we would mark all the participants on the current thread, but then we would mark other threads 'lazilly'. Now, threads move ownership of their stack into the shared dep graph when they block, so that we can mark all the stack frames at once. This also means less cloning on blocking, so it should be mildly more efficient.
The code is DRY-er, since maybe_changed_since has been re-implemented in terms of the same core building blocks as read (probe and friends). I originally tried to unify them, but I realized that they behave somewhat differently from one another and both of them make sense. (In particular, we want to be able to free values with the LRU cache while still checking if they are up to date.)

Ah, I realize now that I had planned to write a bunch of docs in the salsa book before I landed this. Well, I'm going to open the PR anyway, as I've let this branch go far too long.

r? @matklad

If I am not in an error, all current tests uses "static" cycles -- a cycle always present. Let's spice that up by adding conditional cycles: cycles that appear only for specific impls

This allows us to figure out whether a query can recover from a cycle (and how) without invoking the `recover` function.

Find the cycle recovery strategy for a given DatabaseKey.

They now use signals to guarantee we are testing the code paths we want to be testing.

Rather than checking return value of from `Q::cycle_fallback`, we now consult the computed recovery strategy to decide whether to panic or to recover. We can thus assume that we will successfully recover and don't need to check for `None` results anymore.

Before we could not observe the case where: * thread A is blocked on B * cycle detected in thread B * some participants are on thread A and have to be marked (In particular, I commented out some code that seemed necessary and didn't see any tests fail)

Instead of sending the result back, just have the waiting threads retry reading the cache.

Being generic over the keys made code harder to read.

Make `add_edge` infallible

This will allow me to add condvar logic to it

We are going to make it so that the runtime coordinates delivery of the WaitResults.

Thi sis an intermediate step towards having the runtime coordinate wakeups.

Instead of creating a future for each edge in the graph, we now have all dependent queries block in the runtime and wake up whenever results are published to see if their results are ready. We could certainly allocate a CondVar for each dependent query if we found that spurious wakeups were a problem. I consider this highly unlikely in practice.

Currently, when one thread blocks on another, we clone the stack from that task. This results in a lot of clones, but it also means that we can't mark all the frames involved in a cycle atomically. Instead, when we propagate information between threads, we also propagate the participants of the cycle and so forth. This branch *moves* the stack into the runtime while a thread is blocked, and then moves it back out when the thread resumes. This permits the runtime to mark all the cycle participants at once. It also avoids cloning.

I was finding the parallel test setup hard to read, so everything relating to one test is now in a single file, with shorter names.

The previous version had a lot of spurious wakeups, but it also resisted the refactoring I'm about to do. =)

Including the corner case where the active thread does not have recovery.

nikomatsakis · 2021-11-11T15:41:41Z

@matklad ok, the code now supports cycle fallback so long as any one query has it. I also added tests for all the wicked situations I could come up with. The main thing that's not done is updating the RFC and Salsa book to describe how the algorithm works.

nikomatsakis · 2021-11-11T15:42:14Z

Although in the process of writing those comments I already found one bug (the fix for which is already part of these commits), so I wouldn't be surprised if doing more writing helped to flush out more bugs.

When a query Q invokes a cycle Q1...Q1 but Q is not a participant in that cycle, Q should not recover! Test that.

nikomatsakis · 2021-11-12T14:02:59Z

OK, the RFC is more up to date now.

book/src/plumbing/cycles.md

src/runtime.rs

src/derived/slot.rs

src/runtime/dependency_graph.rs

Co-authored-by: Aleksey Kladov <[email protected]>

nikomatsakis · 2021-11-13T21:35:18Z

@matklad I believe I have addressed all of your review comments.

nikomatsakis · 2022-01-21T18:27:03Z

I discussed this with @matklad and we agreed to merge this PR.

bors r+

bors · 2022-01-21T18:29:49Z

Build succeeded:

flodiebold and others added 30 commits October 30, 2021 11:07

Add failing test for cycle revalidation

ae8348c

Add tests for "dynamic" cycles

0f3bc72

If I am not in an error, all current tests uses "static" cycles -- a cycle always present. Let's spice that up by adding conditional cycles: cycles that appear only for specific impls

add should-panic annotations to cycle tests that fail

b72b251

enable debug logging in the cycles tests

40139ab

improve panic error message

7b4ee6f

introduce CYCLE_STRATEGY constant for queries

d082270

This allows us to figure out whether a query can recover from a cycle (and how) without invoking the `recover` function.

isolate find_cycle_participants into its own fn

853006f

introduce cycle_recovery_strategy function

e490886

Find the cycle recovery strategy for a given DatabaseKey.

add cycle_recovery_strategy function on database

fc826b0

find cycle recovery strategy for a given cycle

bcffa4a

document what we are testing, rename variables

b4a0453

improve parallel cycle tests

7b9c383

They now use signals to guarantee we are testing the code paths we want to be testing.

use computed recovery strategy

42a653c

Rather than checking return value of from `Q::cycle_fallback`, we now consult the computed recovery strategy to decide whether to panic or to recover. We can thus assume that we will successfully recover and don't need to check for `None` results anymore.

move CycleError to plumbing

187bd54

improve parallel cycle tests

79f8acc

Before we could not observe the case where: * thread A is blocked on B * cycle detected in thread B * some participants are on thread A and have to be marked (In particular, I commented out some code that seemed necessary and didn't see any tests fail)

move CycleError to the plumbing module

0298163

fix typo that's always bugging me

eb1e06d

introduce Retry probe result

66b26f0

Instead of sending the result back, just have the waiting threads retry reading the cache.

silence lint

2d7a84b

extract DependencyGraph into its own module

da188fe

make the dep-graph not generic

e870d02

Being generic over the keys made code harder to read.

extract the depends_on helper function

ec38398

Make `add_edge` infallible

rename labels to query_dependents

ba61657

remove explicit Default impl

213e16f

rework the dep-graph API to take the lock

f5a15e5

This will allow me to add condvar logic to it

move WaitResult to runtime

21cb4ef

We are going to make it so that the runtime coordinates delivery of the WaitResults.

midpoint: make runtime take wait-result

3c094d2

Thi sis an intermediate step towards having the runtime coordinate wakeups.

remove unnecessary type argument K

5bd2cbc

nikomatsakis added 3 commits November 11, 2021 06:54

spread parallel tests into their own files

bfa74bc

I was finding the parallel test setup hard to read, so everything relating to one test is now in a single file, with shorter names.

introduce a condvar per thread, instead of one

92b5c02

The previous version had a lot of spurious wakeups, but it also resisted the refactoring I'm about to do. =)

factor out a helper to unblock a given runtime

93ee78e

nikomatsakis force-pushed the disappearing-cycles branch from b2bbd0e to 382147f Compare November 11, 2021 11:55

nikomatsakis added 2 commits November 11, 2021 09:07

enable partial recovery across threads

cb658b9

Including the corner case where the active thread does not have recovery.

improve RFC guide description

b8f628d

nikomatsakis force-pushed the disappearing-cycles branch from 382147f to b8f628d Compare November 11, 2021 14:07

nikomatsakis added 2 commits November 12, 2021 05:50

don't recover when not a participant

5ebd221

When a query Q invokes a cycle Q1...Q1 but Q is not a participant in that cycle, Q should not recover! Test that.

describe implementation in RFC

db2daf0

move the RFC contents to salsa book

7081fe4