Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make scx_rusty interactive #261

Merged
merged 3 commits into from
May 3, 2024
Merged

Make scx_rusty interactive #261

merged 3 commits into from
May 3, 2024

Conversation

Byte-Lab
Copy link
Contributor

@Byte-Lab Byte-Lab commented May 3, 2024

Overview

This is the first iteration that attempts to make scx_rusty more accommodating to interactive workloads. The patch set does the following:

  • Makes scx_rusty a deadline scheduler, rather than a simple vtime scheduler. The deadline is calculated according to a number of factors:
    • First and foremost, the deadline is determined by a task's average runtime. Unlike with EEVDF, which sets a task's deadline based on its slice length, scx_rusty tracks a task's average runtime, and scales its deadline inversely according to its weight.
    • scx_rusty also tracks the frequency with which a task is blocked, and the frequency with which a task wakes other tasks. In the former case, the task is likely to be a consumer, and in the latter case, a producer (or both if the frequency is high for both). We calculate a lat_prio value for tasks when setting a deadline, which is inversely scaled by a task's block and waker frequency, and positively scaled by a task's average runtime.

While we could almost certainly go a lot further with calculating tasks' lat_prio values, this already performs quite well. More on this below.

  • Updates scx_rusty to have a dynamic slice length. A longer slice length is used for under-utilized hosts, and a much shorter slice length is used for over-utilized hosts. These under/over util slice length values can be set when the scheduler is loaded, but are static thereafter. The scheduler will track utilization, and will adjust to use one or the other slice length depending on that util.

This is another example where we could almost certainly go further. For example, we could track slice length as a per-task construct, and e.g. throttle a task's slice when we determine that its average runtime is too high. For now, this also gives us very good results.

Results

scx_rusty seems to perform remarkably well on common interactive workloads. I'll describe some of the benchmarks I ran below.

All benchmarks were run on a Ryzen 9 7950X. Each benchmark (unless stated otherwise) was run concurrently with an active Spotify session, as well as severe CPU contention via $ stress-ng -c $((4 * $(nproc))):

Terraria

Running terraria with scx_rusty under the above conditions results in roughly a 60-70% FPS improvement compared to EEVDF (on v6.8). See the following video for a demonstration: https://drive.google.com/file/d/1fyHt9BYGha6apl7HAkibwpy52UTi8-AQ/view?usp=sharing

Civilization 6

Running the standard, CPU-bound AI benchmark on Civilization 6 resulted in roughly a 2.5x improvement with scx_rusty over EEVDF:

EEVDF:

civ_6_eevdf

scx_rusty:

civ_6_rusty

If run without overcommitting the host, both schedulers appeared to perform equally.

kcompile

Finally, I also tested doing a kcompile while the system is severely overutilized:

$ make CC=clang -j allyesconfig
$ /usr/bin/time make CC=clang -j $(nproc) # <-- testing this

EEVDF:

35:37.42elapsed

scx_rusty:

35:19.05elapsed

Only a sample size of one so not statistically significant, but at least indicative that this doesn't cause a regression.

Future work

Here are some ideas for how this patch set could be expanded upon in the future:

  1. Being more intentional and mathematical with how we're calculating lat_prio from waker/block frequencies and avg_runtime. The values chosen were fairly arbitrary and anecdotal. We should probably be more intentional and mathematically sound with how we're applying these.
  2. Dynamically setting per-task slice, and be more fine-grained than just two global values
  3. Make scx_rusty preempting to even further help interactive tasks

Let's remove the extraneous copy pasting and use a lookup helper like we
do for task and pcpu context.

Signed-off-by: David Vernet <[email protected]>
@Byte-Lab Byte-Lab requested review from htejun and multics69 May 3, 2024 02:00
Copy link
Contributor

@htejun htejun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overall looks fantastic. Thanks for the excellent work.

@arighi
Copy link
Contributor

arighi commented May 3, 2024

* Updates `scx_rusty` to have a dynamic slice length. A longer slice length is used for under-utilized hosts, and a much shorter slice length is used for over-utilized hosts. These under/over util slice length values can be set when the scheduler is loaded, but are static thereafter. The scheduler will track utilization, and will adjust to use one or the other slice length depending on that util.

This is another example where we could almost certainly go further. For example, we could track slice length as a per-task construct, and e.g. throttle a task's slice when we determine that its average runtime is too high. For now, this also gives us very good results.

Given that scx_rusty has the concept of domains, it'd be interesting to see if there's any benefit using a per-domain slice, calculated as a function of each individual domain's load.

Copy link
Contributor

@multics69 multics69 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, it looks great to me. Thanks for the excellent work! I am glad that the high-level idea of LAVD was adopted, although, of course, the details are different. Also, I like the idea of considering the task's nice weight when calculating latency priority. I left a few comments asking for further clarification of high-level ideas behind some logic.

scheds/rust/scx_rusty/src/bpf/main.bpf.c Outdated Show resolved Hide resolved
scheds/rust/scx_rusty/src/bpf/main.bpf.c Outdated Show resolved Hide resolved
scheds/rust/scx_rusty/src/bpf/main.bpf.c Show resolved Hide resolved
scheds/rust/scx_rusty/src/bpf/main.bpf.c Outdated Show resolved Hide resolved
scheds/rust/scx_rusty/src/bpf/main.bpf.c Show resolved Hide resolved
scheds/rust/scx_rusty/src/bpf/main.bpf.c Outdated Show resolved Hide resolved
@ptr1337
Copy link
Contributor

ptr1337 commented May 3, 2024

Tested on 7950X3D and provides a massive improvement in interactivity. Benchmarks are also looking good compared to EEVDF.

scx_rusty doesn't do terribly well with interactive workloads. In order
to improve the situation, this patch adds support for basic deadline
scheduling in rusty. This approach doesn't incorporate eligibility, and
simply uses a crude avg_runtime tracking approach to scaling a task's
deadline.

In a series of follow-on changes, we'll update the scheduler to use more
indicators for interactivity that affect both slice length, and deadline
calculation.

Signed-off-by: David Vernet <[email protected]>
In user space in rusty, the tuner detects system utilization, and uses
it to inform how we do load balancing, our greedy / direct cpumasks,
etc. Something else we could be doing but currently aren't, is using
system utilization to inform how we dispatch tasks. We currently have a
static, unchanging slice length for the runtime of the program, but this
is inefficient for all scenarios.

Giving a task a long slice length does have advantages, such as
decreasing the number of involuntary context switches, decreasing the
overhead of preemption by doing it less frequently, possibly getting
better cache locality due to a task running on a CPU for a longer amount
of time, etc. On the other hand, long slices can be problematic as well.
When a system is highly utilized, a CPU-hogging task running for too
long can harm interactive tasks. When the system is under-utilized,
those interactive tasks can likely find an idle, or under-utilized core
to run on. When the system is over-utilized, however, they're likely to
have to park in a runqueue.

Thus, in order to better accommodate such scenarios, this patch
implements a rudimentary slice scaling mechanism in scx_rusty. Rather
than having one global, static slice length, we instead have a dynamic,
global slice length that can be changed depending on system utilization.
When over-utilized, we go with a longer slice length, and vice versa for
when the system is under-utilized. With Terraria, this results in
roughly a 50% improvement in mean FPS when playing on an AMD Ryzen 9
7950X, while running Spotify, and stress-ng -c $((4 * $(nproc))).

Signed-off-by: David Vernet <[email protected]>
@Byte-Lab Byte-Lab merged commit efb97de into main May 3, 2024
1 check passed
@Byte-Lab Byte-Lab deleted the rusty_interactive branch May 3, 2024 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants