-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Primary cache: support reentrancy #4980
Conversation
ef4d61c
to
5cda35f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great writeups, thank you for diving deep and noting everything down in particular the tradeoffs on ranges.
I'm not sure how much I'd see the operation overall state of things as a hack but you're right that in the future we'll need to do cache filling operations in bulk ahead of time if only to preserve sanity when dealing with promises, so just a matter of wording & taste ;-)
similar, very nit'y note: technically a non-work-stealing job system can also run into this problem if the work scheduled on the worker thread happens to clash with the task it paused. the problem is that it is a scheduler that does work while waiting on task join - it doesn't matter whether it steals that work item from another worker's queue or not which is afaik all work-stealing means here. In a way the problem is more that rayon tries to immitate fibers here
let iter_callback = |query: &RangeQuery, range_cache: &crate::RangeCache, f: &mut F| -> crate::Result<()> { | ||
re_tracing::profile_scope!("range", format!("{query:?}")); | ||
|
||
// We don't bother implementing the slow path here (busy write lock), as that would |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should still debug log though that we missed the lock
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is way too spammy in practice (i tried) -- missing the lock happens quite literally all the time if you have a handful of views asking for the same data as long as they run in parallel.
What's really important though is whether the locked was missed and the data wasn't cached by neither the current thread nor any other; but that we cannot really know for the range case unless we add some more complexity.
Though if it does happen in practice, we should be noticing it in the form of awful visual glitches pretty quickly.
if query.range.min <= TimeInt::MIN { | ||
let mut reduced_query = query.clone(); | ||
// This is the reduced query corresponding to the timeless part of the data. | ||
// It is inclusive and so it will yield `MIN..=MIN` = `[MIN]`. | ||
reduced_query.range.max = TimeInt::MIN; // inclusive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit confused by this, why was it not needed before and why ca the query be malformed to begin with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was already there before, the diff is just misleading!
The query is not malformed; we just have to do... things... because of #4832 🙄:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the link!
44652d2
to
5cda35f
Compare
This adds support for reentrancy in the latest-at and range caches in order to support a very nasty edge-case where two rayon tasks (i.e. space views) that query the same exact data (e.g. because they are clones of each other) end up running concurrently on the same thread.
This can happen because we execute space views through multiple nested layers of parallel iterators, and because rayon's scheduler is a work-stealing one, this effectively means one thread can jump to anywhere in the code at any point, and might do so while holding a lock it shouldn't.
This becomes a problem now that querying data involves mutations and locks, due to the presence of a cache on the path.
There is a lot of complexity we could add on top of what this PR already does in order to make this edge-case more efficient, but there is no reason to go there unless there is any indication that this is good not enough in practice (i.e. you don't even notice it's going on).
If it turns out that we can see glitches in practice, we'll go there.
Taking a step back, it's important to realize that this is just another side-effect of our current "immediate mode querying model", where each space view computes its dataset on its own at the very last second while it is rendering, therefore mixing up computing the data and using the data, running identical queries multiple times, etc.
We already know that we want to --and have to-- move away from this model in order to make our upcoming features possible (on-disk data, component conversions, data overrides, external store hub, etc); so I'd rather not sink any more complexity than the bare minimum required in this thing -- it has to go away anyhow.
24-01-31_11.11.42.patched.mp4
24-01-31_11.10.38.patched.mp4
Checklist
main
build: app.rerun.ionightly
build: app.rerun.io