-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add CPU frequency related helpers and extend scx_layered #239
Conversation
@@ -560,30 +560,42 @@ void BPF_STRUCT_OPS(layered_dispatch, s32 cpu, struct task_struct *prev) | |||
|
|||
/* consume preempting layers first */ | |||
bpf_for(idx, 0, nr_layers) | |||
if (layers[idx].preempt && scx_bpf_consume(idx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if layered_dispatch
is the right place to put this, so please let me know if there is a more suitable place. Probably could have cleaned up some of the redundant checks as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll want to put this in layered_running()
. That callback is invoked just before a task starts running on a CPU, so we can look at what layer it's in and apply the frequency tuning accordingly.
As an aside, layered_dispatch()
isn't the only place where we can choose where a task will run. layered_select_cpu()
, which is called on the task wakeup (and task fork) path, decide which CPU a task should be migrated to before being enqueued. We can "directly dispatch" to local DSQs from there, which is conceptually the same thing as calling scx_bpf_consume()
. See scx_bpf_dispatch(p, SCX_DSQ_LOCAL, ...)
. In the former case, we're putting a task directly on the local DSQ where the core kernel will pull the task from to run, and in the latter case, we're taking a task from another DSQ and putting it on the local DSQ. The end result is the same -- we're specifying a task that should run next on that CPU. By instead just looking up a task's layer from layered_running()
, we're guaranteed to hit all the cases.
If we want to speed up the context switch path, we could also reset the frequency in layered_stopping()
, but that could have the downside of accidentally throttling a CPU that's running a high-pri layer task.
if (layer->perf > 0) | ||
scx_bpf_cpuperf_set(cpu, layer->perf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in lookup_layer_cpumask()
, we invoke scx_bpf_error()
if we fail to lookup the layer's cpumask. scx_bpf_error()
will cause the main kernel to boot out the scheduler, so in general it's not typically useful to do any extra work on error paths, such as e.g. setting the frequency here.
#[serde(default)] | ||
perf: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is certainly one option for how to expose this knob to configurations. Another option would be to expose this as something like QoS, and have the scheduler implementation abstract how that's eventually applied with scx_bpf_cpuperf_set()
. A few things to consider:
-
Say we have a layer that has very relaxed runtime requirements, and one with very sensitive latency requirements. If both layers are occupying CPUs on the system, it may be desirable to run the low-pri layer at a low frequency if the goal is to allow the high-pri layer running on other cores to run hotter. On the other hand, if the high-pri layer isn't running at that moment, we'd ideally like to run the lower-pri layer at a higher frequency given that it won't affect the high-pri layer. If we directly passed perf through the config like this, it would cause a low-pri layer to always run with low perf even if there's nothing else on the system for it to contend with. Maybe that's OK (or possibly even desirable given how predictable the behavior would be) for a configuration-driven scheduler which really isn't meant to be a general purpose scheduling solution, but it also does feel a bit suboptimal.
-
I might be getting a bit ahead of myself here, but if we ever want to have layered run on asym chips like ARM big.LITTLE, just specifying perf on its own wouldn't be sufficient. Maybe having QoS would be, and the scheduler could internally use
scx_bpf_cpuperf_set()
,scx_bpf_cpuperf_cap()
, andscx_bpf_cpuperf_cur()
to find the right CPU and frequency for it? Not sure, maybe we'd need new configs. -
I expect it would be difficult for a user to know what value to choose for perf. It seems like what a user really wants to say is, "this layer is super important and should be boosted". Maybe it would even be as simple as keeping a cpumask that has bits set for every CPU running a high-pri task. If the weight of that cpumask is nonzero, then we'd know that we (may) need to throttle any non-high-pri tasks.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be better to expose the raw number. Providing an interface which has higher level semantics requires a lot more knowledge than is currently available in the stack. We just don't have enough context to determine what a semantical QoS level should map to on a given hardware (e.g. the desktop processors I'm testing have only three or so perf levels and they aren't evenly spaced and I have no idea what the perf / power ratio is at each point).
Also, given how scx_layered is for hand crafted boutique setups, abstraction here can be more of a hurdle than being useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I mostly went with the raw number out of pure laziness of being able to quickly test things without thinking about the API. At least with the uclamp API there are buckets, which could be interesting think about. Maybe another higher level abstraction would help with that, I'm not sure.
My only other thought is what is a reasonable default? I was thinking if the (default) value is 0 then don't apply any frequency scaling and let the frequency governor/driver handle scaling.
@@ -560,30 +560,42 @@ void BPF_STRUCT_OPS(layered_dispatch, s32 cpu, struct task_struct *prev) | |||
|
|||
/* consume preempting layers first */ | |||
bpf_for(idx, 0, nr_layers) | |||
if (layers[idx].preempt && scx_bpf_consume(idx)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll want to put this in layered_running()
. That callback is invoked just before a task starts running on a CPU, so we can look at what layer it's in and apply the frequency tuning accordingly.
As an aside, layered_dispatch()
isn't the only place where we can choose where a task will run. layered_select_cpu()
, which is called on the task wakeup (and task fork) path, decide which CPU a task should be migrated to before being enqueued. We can "directly dispatch" to local DSQs from there, which is conceptually the same thing as calling scx_bpf_consume()
. See scx_bpf_dispatch(p, SCX_DSQ_LOCAL, ...)
. In the former case, we're putting a task directly on the local DSQ where the core kernel will pull the task from to run, and in the latter case, we're taking a task from another DSQ and putting it on the local DSQ. The end result is the same -- we're specifying a task that should run next on that CPU. By instead just looking up a task's layer from layered_running()
, we're guaranteed to hit all the cases.
If we want to speed up the context switch path, we could also reset the frequency in layered_stopping()
, but that could have the downside of accidentally throttling a CPU that's running a high-pri layer task.
This change adds `scx_bpf_cpuperf_cap`, `scx_bpf_cpuperf_cur` and `scx_bpf_cpuperf_set` definitions that were recently introduced into [`sched_ext`](sched-ext/sched_ext#180). It adds a `perf` field to `scx_layered` to allow for controlling performance per layer. Signed-off-by: Daniel Hodges <[email protected]>
7ed200b
to
32e97bf
Compare
Should we add logic to account for running on kernels that don't have Edit: Working on this now |
This change adds
scx_bpf_cpuperf_cap
,scx_bpf_cpuperf_cur
andscx_bpf_cpuperf_set
definitions that were recently introduced intosched_ext
. It adds aperf
field toscx_layered
to allow for controlling performance per layer.