Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Primary cache: support reentrancy #4980

Merged
merged 4 commits into from
Jan 31, 2024
Merged

Primary cache: support reentrancy #4980

merged 4 commits into from
Jan 31, 2024

Conversation

teh-cmc
Copy link
Member

@teh-cmc teh-cmc commented Jan 31, 2024

This adds support for reentrancy in the latest-at and range caches in order to support a very nasty edge-case where two rayon tasks (i.e. space views) that query the same exact data (e.g. because they are clones of each other) end up running concurrently on the same thread.
This can happen because we execute space views through multiple nested layers of parallel iterators, and because rayon's scheduler is a work-stealing one, this effectively means one thread can jump to anywhere in the code at any point, and might do so while holding a lock it shouldn't.
This becomes a problem now that querying data involves mutations and locks, due to the presence of a cache on the path.

There is a lot of complexity we could add on top of what this PR already does in order to make this edge-case more efficient, but there is no reason to go there unless there is any indication that this is good not enough in practice (i.e. you don't even notice it's going on).
If it turns out that we can see glitches in practice, we'll go there.

Taking a step back, it's important to realize that this is just another side-effect of our current "immediate mode querying model", where each space view computes its dataset on its own at the very last second while it is rendering, therefore mixing up computing the data and using the data, running identical queries multiple times, etc.
We already know that we want to --and have to-- move away from this model in order to make our upcoming features possible (on-disk data, component conversions, data overrides, external store hub, etc); so I'd rather not sink any more complexity than the bare minimum required in this thing -- it has to go away anyhow.

24-01-31_11.11.42.patched.mp4
24-01-31_11.10.38.patched.mp4

Checklist

  • I have read and agree to Contributor Guide and the Code of Conduct
  • I've included a screenshot or gif (if applicable)
  • I have tested the web demo (if applicable):
  • The PR title and labels are set such as to maximize their usefulness for the next release's CHANGELOG

@teh-cmc teh-cmc added 🔍 re_query affects re_query itself 💣 crash crash, deadlock/freeze, do-no-start exclude from changelog PRs with this won't show up in CHANGELOG.md labels Jan 31, 2024
@teh-cmc teh-cmc force-pushed the cmc/cache_read_locks branch from ef4d61c to 5cda35f Compare January 31, 2024 09:51
@teh-cmc teh-cmc marked this pull request as ready for review January 31, 2024 10:12
@teh-cmc teh-cmc requested review from emilk and Wumpf January 31, 2024 10:12
Copy link
Member

@Wumpf Wumpf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great writeups, thank you for diving deep and noting everything down in particular the tradeoffs on ranges.

I'm not sure how much I'd see the operation overall state of things as a hack but you're right that in the future we'll need to do cache filling operations in bulk ahead of time if only to preserve sanity when dealing with promises, so just a matter of wording & taste ;-)

similar, very nit'y note: technically a non-work-stealing job system can also run into this problem if the work scheduled on the worker thread happens to clash with the task it paused. the problem is that it is a scheduler that does work while waiting on task join - it doesn't matter whether it steals that work item from another worker's queue or not which is afaik all work-stealing means here. In a way the problem is more that rayon tries to immitate fibers here

let iter_callback = |query: &RangeQuery, range_cache: &crate::RangeCache, f: &mut F| -> crate::Result<()> {
re_tracing::profile_scope!("range", format!("{query:?}"));

// We don't bother implementing the slow path here (busy write lock), as that would
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should still debug log though that we missed the lock

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is way too spammy in practice (i tried) -- missing the lock happens quite literally all the time if you have a handful of views asking for the same data as long as they run in parallel.

What's really important though is whether the locked was missed and the data wasn't cached by neither the current thread nor any other; but that we cannot really know for the range case unless we add some more complexity.
Though if it does happen in practice, we should be noticing it in the form of awful visual glitches pretty quickly.

crates/re_query_cache/src/latest_at.rs Outdated Show resolved Hide resolved
Comment on lines +317 to +321
if query.range.min <= TimeInt::MIN {
let mut reduced_query = query.clone();
// This is the reduced query corresponding to the timeless part of the data.
// It is inclusive and so it will yield `MIN..=MIN` = `[MIN]`.
reduced_query.range.max = TimeInt::MIN; // inclusive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit confused by this, why was it not needed before and why ca the query be malformed to begin with

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was already there before, the diff is just misleading!
The query is not malformed; we just have to do... things... because of #4832 🙄:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the link!

@teh-cmc teh-cmc force-pushed the cmc/cache_read_locks branch from 44652d2 to 5cda35f Compare January 31, 2024 11:06
@teh-cmc teh-cmc merged commit ea63c31 into main Jan 31, 2024
40 checks passed
@teh-cmc teh-cmc deleted the cmc/cache_read_locks branch January 31, 2024 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💣 crash crash, deadlock/freeze, do-no-start exclude from changelog PRs with this won't show up in CHANGELOG.md 🔍 re_query affects re_query itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deadlock with primary caching is on
2 participants