Make scx_rusty interactive #261

Byte-Lab · 2024-05-03T02:00:22Z

Overview

This is the first iteration that attempts to make scx_rusty more accommodating to interactive workloads. The patch set does the following:

Makes scx_rusty a deadline scheduler, rather than a simple vtime scheduler. The deadline is calculated according to a number of factors:
- First and foremost, the deadline is determined by a task's average runtime. Unlike with EEVDF, which sets a task's deadline based on its slice length, scx_rusty tracks a task's average runtime, and scales its deadline inversely according to its weight.
- scx_rusty also tracks the frequency with which a task is blocked, and the frequency with which a task wakes other tasks. In the former case, the task is likely to be a consumer, and in the latter case, a producer (or both if the frequency is high for both). We calculate a lat_prio value for tasks when setting a deadline, which is inversely scaled by a task's block and waker frequency, and positively scaled by a task's average runtime.

While we could almost certainly go a lot further with calculating tasks' lat_prio values, this already performs quite well. More on this below.

Updates scx_rusty to have a dynamic slice length. A longer slice length is used for under-utilized hosts, and a much shorter slice length is used for over-utilized hosts. These under/over util slice length values can be set when the scheduler is loaded, but are static thereafter. The scheduler will track utilization, and will adjust to use one or the other slice length depending on that util.

This is another example where we could almost certainly go further. For example, we could track slice length as a per-task construct, and e.g. throttle a task's slice when we determine that its average runtime is too high. For now, this also gives us very good results.

Results

scx_rusty seems to perform remarkably well on common interactive workloads. I'll describe some of the benchmarks I ran below.

All benchmarks were run on a Ryzen 9 7950X. Each benchmark (unless stated otherwise) was run concurrently with an active Spotify session, as well as severe CPU contention via $ stress-ng -c $((4 * $(nproc))):

Terraria

Running terraria with scx_rusty under the above conditions results in roughly a 60-70% FPS improvement compared to EEVDF (on v6.8). See the following video for a demonstration: https://drive.google.com/file/d/1fyHt9BYGha6apl7HAkibwpy52UTi8-AQ/view?usp=sharing

Civilization 6

Running the standard, CPU-bound AI benchmark on Civilization 6 resulted in roughly a 2.5x improvement with scx_rusty over EEVDF:

EEVDF:

`scx_rusty`:

If run without overcommitting the host, both schedulers appeared to perform equally.

kcompile

Finally, I also tested doing a kcompile while the system is severely overutilized:

$ make CC=clang -j allyesconfig
$ /usr/bin/time make CC=clang -j $(nproc) # <-- testing this

EEVDF:

35:37.42elapsed

`scx_rusty`:

35:19.05elapsed

Only a sample size of one so not statistically significant, but at least indicative that this doesn't cause a regression.

Future work

Here are some ideas for how this patch set could be expanded upon in the future:

Being more intentional and mathematical with how we're calculating lat_prio from waker/block frequencies and avg_runtime. The values chosen were fairly arbitrary and anecdotal. We should probably be more intentional and mathematically sound with how we're applying these.
Dynamically setting per-task slice, and be more fine-grained than just two global values
Make scx_rusty preempting to even further help interactive tasks

Let's remove the extraneous copy pasting and use a lookup helper like we do for task and pcpu context. Signed-off-by: David Vernet <[email protected]>

scheds/rust/scx_rusty/src/bpf/main.bpf.c

htejun

This overall looks fantastic. Thanks for the excellent work.

arighi · 2024-05-03T07:06:39Z

* Updates `scx_rusty` to have a dynamic slice length. A longer slice length is used for under-utilized hosts, and a much shorter slice length is used for over-utilized hosts. These under/over util slice length values can be set when the scheduler is loaded, but are static thereafter. The scheduler will track utilization, and will adjust to use one or the other slice length depending on that util.
This is another example where we could almost certainly go further. For example, we could track slice length as a per-task construct, and e.g. throttle a task's slice when we determine that its average runtime is too high. For now, this also gives us very good results.

Given that scx_rusty has the concept of domains, it'd be interesting to see if there's any benefit using a per-domain slice, calculated as a function of each individual domain's load.

multics69

Overall, it looks great to me. Thanks for the excellent work! I am glad that the high-level idea of LAVD was adopted, although, of course, the details are different. Also, I like the idea of considering the task's nice weight when calculating latency priority. I left a few comments asking for further clarification of high-level ideas behind some logic.

scheds/rust/scx_rusty/src/bpf/main.bpf.c

ptr1337 · 2024-05-03T15:49:44Z

Tested on 7950X3D and provides a massive improvement in interactivity. Benchmarks are also looking good compared to EEVDF.

scx_rusty doesn't do terribly well with interactive workloads. In order to improve the situation, this patch adds support for basic deadline scheduling in rusty. This approach doesn't incorporate eligibility, and simply uses a crude avg_runtime tracking approach to scaling a task's deadline. In a series of follow-on changes, we'll update the scheduler to use more indicators for interactivity that affect both slice length, and deadline calculation. Signed-off-by: David Vernet <[email protected]>

In user space in rusty, the tuner detects system utilization, and uses it to inform how we do load balancing, our greedy / direct cpumasks, etc. Something else we could be doing but currently aren't, is using system utilization to inform how we dispatch tasks. We currently have a static, unchanging slice length for the runtime of the program, but this is inefficient for all scenarios. Giving a task a long slice length does have advantages, such as decreasing the number of involuntary context switches, decreasing the overhead of preemption by doing it less frequently, possibly getting better cache locality due to a task running on a CPU for a longer amount of time, etc. On the other hand, long slices can be problematic as well. When a system is highly utilized, a CPU-hogging task running for too long can harm interactive tasks. When the system is under-utilized, those interactive tasks can likely find an idle, or under-utilized core to run on. When the system is over-utilized, however, they're likely to have to park in a runqueue. Thus, in order to better accommodate such scenarios, this patch implements a rudimentary slice scaling mechanism in scx_rusty. Rather than having one global, static slice length, we instead have a dynamic, global slice length that can be changed depending on system utilization. When over-utilized, we go with a longer slice length, and vice versa for when the system is under-utilized. With Terraria, this results in roughly a 50% improvement in mean FPS when playing on an AMD Ryzen 9 7950X, while running Spotify, and stress-ng -c $((4 * $(nproc))). Signed-off-by: David Vernet <[email protected]>

rusty: Use helper to lookup domain context

925a69b

Let's remove the extraneous copy pasting and use a lookup helper like we do for task and pcpu context. Signed-off-by: David Vernet <[email protected]>

Byte-Lab requested review from htejun and multics69 May 3, 2024 02:00