Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_bpfland: enhance performance consistency and predictability #410

Merged
merged 6 commits into from
Jul 4, 2024

Conversation

arighi
Copy link
Contributor

@arighi arighi commented Jul 3, 2024

A set of changes to make performance more consistent and prevent spikes:

  • scx_bpfland: adjust default time slice to 5ms (default tuning for latency)
  • scx_bpfland: ensure task time slice never exceeds the slice_ns limit (latency improvement)
  • scx_bpfland: prevent starvation of regular tasks (prevent interactive tasks from monopolizing the CPUs)
  • scx_bpfland: unify dispatching kthreads with direct CPU dispatches (dispatch kthreads using per-CPU DSQs instead of SCX_DSQ_LOCAL)
  • scx_bpfland: drop built-in idle CPU selection logic (drop unused option / cleanup)

Thanks to the CachyOS community that provided an extensive amount of test results and feedback about these changes.

Andrea Righi added 2 commits July 3, 2024 08:54
Small refactoring of the idle CPU selection logic:
 - optimize idle CPU selection for tasks that can run on a single CPU
 - drop the built-in idle selection policy and completely rely on the
   custom one

Signed-off-by: Andrea Righi <[email protected]>
Always use direct CPU dispatch for kthreads, there is no need to treat
kthreads in a special way, simply reuse direct CPU dispatch to
prioritize them.

Moreover, change direct CPU dispatches to use scx_bpf_dispatch_vtime(),
since we may dispatch multiple tasks to the same per-CPU DSQ now.

Signed-off-by: Andrea Righi <[email protected]>
@sirlucjan
Copy link
Contributor

I've been testing it for a few days. Everything works very well.

@ptr1337
Copy link
Contributor

ptr1337 commented Jul 3, 2024

Also tested by me. Performance seems very good as well as reliability and interactivity.

struct bpf_cpumask *mask;
int ret = false;

bpf_rcu_read_lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessarily from this PR but what is RCU read lock protecting?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm.. I had the impression that the verifier was complaining without the RCU protection, but it's not the case (and I also don't see anything to protect here). Probably it was needed to protect something else that was moved after some refactoring. Anyway, thanks for noticing it, I'll drop the RCU read lock here!

* required to schedule a regular task. This ensures that regular tasks are not
* starved by interactive tasks.
*/
const volatile u64 starvation_thresh = 100;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Down the line, it may lead to more consistent behavior if the mechanism is time based rather than the number of events. Another thing to consider is that if there are tasks which are affine to some subset of CPUs, they might still be able to starve, so maybe a mechanism which can transfer timing out batch tasks to per-cpu DSQs will probably be more fool-proof. Just ideas for possible future improvements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a time-based approach would definitely be more effective. I'll work on that.

I also like the idea of transferring tasks to per-CPU DSQs. That's something I coud use also for the CPU hotplug support, to transfer tasks to another DSQ from the .cpu_offline() callback, but I haven't found a way to pop tasks from a non-local DSQ and move them to another DSQ, except via scx_bpf_consume(), but this doesn't give me the task, it just "internally" transfers it to the local DSQ and of course I can't use consume in cpu_offline()...

Andrea Righi added 3 commits July 4, 2024 10:24
Reduce the default time slice down to 5ms for a faster reaction and
system responsiveness when the system is overcomissioned.

This also helps to provide a more predictable level of performance.

Signed-off-by: Andrea Righi <[email protected]>
There is no need to RCU protect the cpumask for the offline CPUs: it is
created once when the scheduler is initialized and it's never
deallocated.

Signed-off-by: Andrea Righi <[email protected]>
@arighi
Copy link
Contributor Author

arighi commented Jul 4, 2024

Removed the unnecessary RCU read lock protection + implemented a better time-based starvation prevention. Tested with the usual gaming + building kernel, everything looks good so far (better than before apparently).

Tasks are consumed from various DSQs in the following order:

  per-CPU DSQs => priority DSQ => shared DSQ

Tasks in the shared DSQ may be starved by those in the priority DSQ,
which in turn may be starved by tasks dispatched to any per-CPU DSQ.

To mitigate this, record the timestamp of the last task scheduling event
both from the priority DSQ and the shared DSQ.

If the starvation threshold is exceeded without consuming a task, the
scheduler will be forced to consume a task from the corresponding DSQ.

The starvation threshold can be adjusted using the --starvation-thresh
command line parameter (default is 5ms).

Signed-off-by: Andrea Righi <[email protected]>
@arighi
Copy link
Contributor Author

arighi commented Jul 4, 2024

Merging this as it seems pretty stable and well-tested.

@arighi arighi merged commit 86d2f50 into main Jul 4, 2024
1 check passed
@arighi arighi deleted the bpfland-smooth-perf branch July 4, 2024 19:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants