-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: reduce lock contention in RepartitionExec::execute
#10009
Conversation
The state is initialized ONCE for all partitions. However this may take a short while (on a very busy system 1ms or more). It is quite likely that multiple threads call `execute` at the same time, because we have just fanned out to the number "target partitions" which is likely set to the number of CPU cores which now all try to start to execute the plan at the same time. The solution is to not waste CPU circles in some futex lock but to tell the async runtime (= tokio) that we are performing work and the other threads should rather do something useful. This mostly just moves code around, no functional change intended.
}) | ||
.collect(); | ||
|
||
// TODO: metric input-output mapping is broken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was broken before: the first parameter was set to partition
, i.e. the first partition that initializes the state. That's clearly wrong. I now initialize it to 0
and will fix the tracking properly in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #10015 to track
let mut background_task = JoinSet::new(); | ||
background_task.spawn(async move { | ||
input.wait().await; | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That goes for this and the other tests:
There is NO guarantee that the inputs to RepartitionExec
are polled if you never poll any of the outputs. Hence we need to move this barrier call into a background task so it happens at the same time as the output poll below.
RepartitionExec::execute
RepartitionExec::execute
/benchmark :) |
Benchmark resultsBenchmarks comparing cb21404 (main) and 01a4468 (PR)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @crepererum -- I reviewed this code carefully and it makes sense to me
I think it is important to document the rationale in comments to reduce the chance of this change being accidentally undone in the future, but otherwise I think it is ready to go 🚀
@@ -1240,7 +1294,10 @@ mod tests { | |||
std::mem::drop(output_stream0); | |||
|
|||
// Now, start sending input | |||
input.wait().await; | |||
let mut background_task = JoinSet::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the purpose of this change? I tried this change without the other changes in this PR and the test still passes (I was expecting it would hang or something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
}) | ||
.collect(); | ||
|
||
// TODO: metric input-output mapping is broken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #10015 to track
TBH a slightly slower runtime in a "repeat a single query in a hot-loop" benchmark is somewhat expected. If you have enough tokio threads to spare, you can totally futex-wait. But for any kind of server software that's a bad idea. |
669cd3e
to
df7a034
Compare
Co-authored-by: Andrew Lamb <[email protected]>
df7a034
to
a716e6c
Compare
Which issue does this PR close?
Closes #10014
Rationale for this change
The state is initialized ONCE for all partitions. However this may take a short while (on a very busy system 1ms or more). It is quite likely that multiple threads call
execute
at the same time, because we have just fanned out to the number "target partitions" which is likely set to the number of CPU cores which now all try to start to execute the plan at the same time.The solution is to not waste CPU circles in some futex lock but to tell the async runtime (= tokio) that we are performing work and the other threads should rather do something useful.
What changes are included in this PR?
This mostly just moves code around, no functional change intended.
Are these changes tested?
Existing tests still pass.
Are there any user-facing changes?
Faster query exec.