Primary cache: support reentrancy #4980

teh-cmc · 2024-01-31T09:48:49Z

This adds support for reentrancy in the latest-at and range caches in order to support a very nasty edge-case where two rayon tasks (i.e. space views) that query the same exact data (e.g. because they are clones of each other) end up running concurrently on the same thread.
This can happen because we execute space views through multiple nested layers of parallel iterators, and because rayon's scheduler is a work-stealing one, this effectively means one thread can jump to anywhere in the code at any point, and might do so while holding a lock it shouldn't.
This becomes a problem now that querying data involves mutations and locks, due to the presence of a cache on the path.

There is a lot of complexity we could add on top of what this PR already does in order to make this edge-case more efficient, but there is no reason to go there unless there is any indication that this is good not enough in practice (i.e. you don't even notice it's going on).
If it turns out that we can see glitches in practice, we'll go there.

Taking a step back, it's important to realize that this is just another side-effect of our current "immediate mode querying model", where each space view computes its dataset on its own at the very last second while it is rendering, therefore mixing up computing the data and using the data, running identical queries multiple times, etc.
We already know that we want to --and have to-- move away from this model in order to make our upcoming features possible (on-disk data, component conversions, data overrides, external store hub, etc); so I'd rather not sink any more complexity than the bare minimum required in this thing -- it has to go away anyhow.

Fixes Deadlock with primary caching is on #4947

24-01-31_11.11.42.patched.mp4

24-01-31_11.10.38.patched.mp4

Checklist

I have read and agree to Contributor Guide and the Code of Conduct
I've included a screenshot or gif (if applicable)
I have tested the web demo (if applicable):
- Using newly built examples: app.rerun.io
- Using examples from latest main build: app.rerun.io
- Using full set of examples from nightly build: app.rerun.io
The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG

Wumpf

great writeups, thank you for diving deep and noting everything down in particular the tradeoffs on ranges.

I'm not sure how much I'd see the operation overall state of things as a hack but you're right that in the future we'll need to do cache filling operations in bulk ahead of time if only to preserve sanity when dealing with promises, so just a matter of wording & taste ;-)

similar, very nit'y note: technically a non-work-stealing job system can also run into this problem if the work scheduled on the worker thread happens to clash with the task it paused. the problem is that it is a scheduler that does work while waiting on task join - it doesn't matter whether it steals that work item from another worker's queue or not which is afaik all work-stealing means here. In a way the problem is more that rayon tries to immitate fibers here

Wumpf · 2024-01-31T10:58:33Z

crates/re_query_cache/src/range.rs

+            let iter_callback = |query: &RangeQuery, range_cache: &crate::RangeCache, f: &mut F| -> crate::Result<()> {
+                re_tracing::profile_scope!("range", format!("{query:?}"));
+
+                // We don't bother implementing the slow path here (busy write lock), as that would


we should still debug log though that we missed the lock

That is way too spammy in practice (i tried) -- missing the lock happens quite literally all the time if you have a handful of views asking for the same data as long as they run in parallel.

What's really important though is whether the locked was missed and the data wasn't cached by neither the current thread nor any other; but that we cannot really know for the range case unless we add some more complexity.
Though if it does happen in practice, we should be noticing it in the form of awful visual glitches pretty quickly.

crates/re_query_cache/src/latest_at.rs

Wumpf · 2024-01-31T11:03:45Z

crates/re_query_cache/src/range.rs

+                if query.range.min <= TimeInt::MIN {
+                    let mut reduced_query = query.clone();
+                    // This is the reduced query corresponding to the timeless part of the data.
+                    // It is inclusive and so it will yield `MIN..=MIN` = `[MIN]`.
+                    reduced_query.range.max = TimeInt::MIN; // inclusive


I'm a bit confused by this, why was it not needed before and why ca the query be malformed to begin with

It was already there before, the diff is just misleading!
The query is not malformed; we just have to do... things... because of #4832 🙄:

TimeInt::BEGINNING vs. TimeInt::MIN vs. Option<TimeInt> #4832

thanks for the link!

crates/re_query_cache/src/cache.rs

implement reentrancy support for latest-at cache

39659c9

teh-cmc added 🔍 re_query affects re_query itself 💣 crash crash, deadlock/freeze, do-no-start exclude from changelog PRs with this won't show up in CHANGELOG.md labels Jan 31, 2024

implement reentrancy support for range cache

5cda35f

teh-cmc force-pushed the cmc/cache_read_locks branch from ef4d61c to 5cda35f Compare January 31, 2024 09:51

teh-cmc marked this pull request as ready for review January 31, 2024 10:12

teh-cmc requested review from emilk and Wumpf January 31, 2024 10:12

Wumpf approved these changes Jan 31, 2024

View reviewed changes

teh-cmc force-pushed the cmc/cache_read_locks branch from 44652d2 to 5cda35f Compare January 31, 2024 11:06

teh-cmc commented Jan 31, 2024

View reviewed changes

crates/re_query_cache/src/cache.rs Show resolved Hide resolved

teh-cmc added 2 commits January 31, 2024 15:11

doc

9490441

remove is_cached

d6ad8d7

teh-cmc merged commit ea63c31 into main Jan 31, 2024
40 checks passed

teh-cmc deleted the cmc/cache_read_locks branch January 31, 2024 14:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Primary cache: support reentrancy #4980

Primary cache: support reentrancy #4980

teh-cmc commented Jan 31, 2024 •

edited by github-actions bot

Loading

Wumpf left a comment •

edited

Loading

Wumpf Jan 31, 2024

teh-cmc Jan 31, 2024

Wumpf Jan 31, 2024

teh-cmc Jan 31, 2024

Wumpf Jan 31, 2024

Primary cache: support reentrancy #4980

Primary cache: support reentrancy #4980

Conversation

teh-cmc commented Jan 31, 2024 • edited by github-actions bot Loading

Checklist

Wumpf left a comment • edited Loading

Choose a reason for hiding this comment

Wumpf Jan 31, 2024

Choose a reason for hiding this comment

teh-cmc Jan 31, 2024

Choose a reason for hiding this comment

Wumpf Jan 31, 2024

Choose a reason for hiding this comment

teh-cmc Jan 31, 2024

Choose a reason for hiding this comment

Wumpf Jan 31, 2024

Choose a reason for hiding this comment

teh-cmc commented Jan 31, 2024 •

edited by github-actions bot

Loading

Wumpf left a comment •

edited

Loading