Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CPU frequency related helpers and extend scx_layered #239

Merged
merged 1 commit into from
Apr 24, 2024

Conversation

hodgesds
Copy link
Contributor

This change adds scx_bpf_cpuperf_cap, scx_bpf_cpuperf_cur and scx_bpf_cpuperf_set definitions that were recently introduced into sched_ext. It adds a perf field to scx_layered to allow for controlling performance per layer.

@@ -560,30 +560,42 @@ void BPF_STRUCT_OPS(layered_dispatch, s32 cpu, struct task_struct *prev)

/* consume preempting layers first */
bpf_for(idx, 0, nr_layers)
if (layers[idx].preempt && scx_bpf_consume(idx))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if layered_dispatch is the right place to put this, so please let me know if there is a more suitable place. Probably could have cleaned up some of the redundant checks as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to put this in layered_running(). That callback is invoked just before a task starts running on a CPU, so we can look at what layer it's in and apply the frequency tuning accordingly.

As an aside, layered_dispatch() isn't the only place where we can choose where a task will run. layered_select_cpu(), which is called on the task wakeup (and task fork) path, decide which CPU a task should be migrated to before being enqueued. We can "directly dispatch" to local DSQs from there, which is conceptually the same thing as calling scx_bpf_consume(). See scx_bpf_dispatch(p, SCX_DSQ_LOCAL, ...). In the former case, we're putting a task directly on the local DSQ where the core kernel will pull the task from to run, and in the latter case, we're taking a task from another DSQ and putting it on the local DSQ. The end result is the same -- we're specifying a task that should run next on that CPU. By instead just looking up a task's layer from layered_running(), we're guaranteed to hit all the cases.

If we want to speed up the context switch path, we could also reset the frequency in layered_stopping(), but that could have the downside of accidentally throttling a CPU that's running a high-pri layer task.

@hodgesds
Copy link
Contributor Author

Test setup:

$ cat /proc/cpuinfo | grep 'model name' | head -n 1
model name      : Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor | uniq -c 
     80 schedutil

Config:

[
  {
    "name": "batch",
    "comment": "tasks under system.slice or tasks with nice value > 0",
    "matches": [
      [
        {
          "CgroupPrefix": "system.slice/"
        }
      ],
      [
        {
          "NiceAbove": 0
        }
      ]
    ],
    "kind": {
      "Confined": {
        "util_range": [
          0.01,
          1.0
        ],
        "cpus_range": [
          0,
          80
        ],
        "min_exec_us": 1000,
        "perf": 128
      }
    }
  },
  {
    "name": "normal",
    "comment": "the rest",
    "matches": [
      []
    ],
    "kind": {
      "Open": {
        "min_exec_us": 100,
        "preempt": true,
        "exclusive": true,
        "perf": 128
      }
    }
  }
]

The image shows it was able to clock down the frequency effectively.
image

@hodgesds hodgesds changed the title Adds CPU frequency related helpers and extend scx_layered Add CPU frequency related helpers and extend scx_layered Apr 24, 2024
Comment on lines 576 to 577
if (layer->perf > 0)
scx_bpf_cpuperf_set(cpu, layer->perf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in lookup_layer_cpumask(), we invoke scx_bpf_error() if we fail to lookup the layer's cpumask. scx_bpf_error() will cause the main kernel to boot out the scheduler, so in general it's not typically useful to do any extra work on error paths, such as e.g. setting the frequency here.

Comment on lines +315 to +316
#[serde(default)]
perf: u64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is certainly one option for how to expose this knob to configurations. Another option would be to expose this as something like QoS, and have the scheduler implementation abstract how that's eventually applied with scx_bpf_cpuperf_set(). A few things to consider:

  1. Say we have a layer that has very relaxed runtime requirements, and one with very sensitive latency requirements. If both layers are occupying CPUs on the system, it may be desirable to run the low-pri layer at a low frequency if the goal is to allow the high-pri layer running on other cores to run hotter. On the other hand, if the high-pri layer isn't running at that moment, we'd ideally like to run the lower-pri layer at a higher frequency given that it won't affect the high-pri layer. If we directly passed perf through the config like this, it would cause a low-pri layer to always run with low perf even if there's nothing else on the system for it to contend with. Maybe that's OK (or possibly even desirable given how predictable the behavior would be) for a configuration-driven scheduler which really isn't meant to be a general purpose scheduling solution, but it also does feel a bit suboptimal.

  2. I might be getting a bit ahead of myself here, but if we ever want to have layered run on asym chips like ARM big.LITTLE, just specifying perf on its own wouldn't be sufficient. Maybe having QoS would be, and the scheduler could internally use scx_bpf_cpuperf_set(), scx_bpf_cpuperf_cap(), and scx_bpf_cpuperf_cur() to find the right CPU and frequency for it? Not sure, maybe we'd need new configs.

  3. I expect it would be difficult for a user to know what value to choose for perf. It seems like what a user really wants to say is, "this layer is super important and should be boosted". Maybe it would even be as simple as keeping a cpumask that has bits set for every CPU running a high-pri task. If the weight of that cpumask is nonzero, then we'd know that we (may) need to throttle any non-high-pri tasks.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better to expose the raw number. Providing an interface which has higher level semantics requires a lot more knowledge than is currently available in the stack. We just don't have enough context to determine what a semantical QoS level should map to on a given hardware (e.g. the desktop processors I'm testing have only three or so perf levels and they aren't evenly spaced and I have no idea what the perf / power ratio is at each point).

Also, given how scx_layered is for hand crafted boutique setups, abstraction here can be more of a hurdle than being useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I mostly went with the raw number out of pure laziness of being able to quickly test things without thinking about the API. At least with the uclamp API there are buckets, which could be interesting think about. Maybe another higher level abstraction would help with that, I'm not sure.

My only other thought is what is a reasonable default? I was thinking if the (default) value is 0 then don't apply any frequency scaling and let the frequency governor/driver handle scaling.

@@ -560,30 +560,42 @@ void BPF_STRUCT_OPS(layered_dispatch, s32 cpu, struct task_struct *prev)

/* consume preempting layers first */
bpf_for(idx, 0, nr_layers)
if (layers[idx].preempt && scx_bpf_consume(idx))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to put this in layered_running(). That callback is invoked just before a task starts running on a CPU, so we can look at what layer it's in and apply the frequency tuning accordingly.

As an aside, layered_dispatch() isn't the only place where we can choose where a task will run. layered_select_cpu(), which is called on the task wakeup (and task fork) path, decide which CPU a task should be migrated to before being enqueued. We can "directly dispatch" to local DSQs from there, which is conceptually the same thing as calling scx_bpf_consume(). See scx_bpf_dispatch(p, SCX_DSQ_LOCAL, ...). In the former case, we're putting a task directly on the local DSQ where the core kernel will pull the task from to run, and in the latter case, we're taking a task from another DSQ and putting it on the local DSQ. The end result is the same -- we're specifying a task that should run next on that CPU. By instead just looking up a task's layer from layered_running(), we're guaranteed to hit all the cases.

If we want to speed up the context switch path, we could also reset the frequency in layered_stopping(), but that could have the downside of accidentally throttling a CPU that's running a high-pri layer task.

This change adds `scx_bpf_cpuperf_cap`, `scx_bpf_cpuperf_cur` and
`scx_bpf_cpuperf_set` definitions that were recently introduced into
[`sched_ext`](sched-ext/sched_ext#180). It adds
a `perf` field to `scx_layered` to allow for controlling performance per
layer.

Signed-off-by: Daniel Hodges <[email protected]>
@htejun htejun merged commit 9a9b4dd into sched-ext:main Apr 24, 2024
@Byte-Lab
Copy link
Contributor

Byte-Lab commented Apr 24, 2024

Should we add logic to account for running on kernels that don't have scx_bpf_cpuperf_set()?

Edit: Working on this now

@hodgesds hodgesds deleted the cpufreq_helpers branch May 1, 2024 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants