Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_layered: Add topology awareness for NUMA nodes and LLCs #446

Merged
merged 3 commits into from
Jul 24, 2024

Conversation

hodgesds
Copy link
Contributor

Add support for NUMA node and LLC layer configuration for scx_layered. If the nodes or llcs fields are unset then all CPUs are used. If both fields have values the cpuset is or'ed. This is useful for running scx_layered on NUMA machines where setting a soft affinity to a NUMA node or LLC may be beneficial.

Testing for NUMA nodes using this config:

[{
  "name":"numa_1",
  "comment":"numa1",
  "matches":[
     [
             {"CommPrefix":"stress-ng"}
     ],[
             {"PcommPrefix":"stress-ng"}
     ]],
  "kind": {
    "Confined":{
      "util_range": [0.04, 0.50],
      "min_exec_us":50,
      "nodes":[1]
     }
  }
},{
  "name":"numa_XXXX",
  "comment":"numa 1, noop layer",
  "matches":[
     [{"CommPrefix":"not-a-comm"}]
  ],
  "kind": {
    "Confined":{
      "util_range": [0.01, 0.20],
      "min_exec_us": 0,
      "nodes":[1]
     }
  }
},
{
  "name":"numa_0",
  "comment":"the rest",
  "matches":[[]],
  "kind":{
    "Confined": {
      "util_range": [0.05, 0.80],
      "min_exec_us": 0,
      "nodes":[0]
    }
  }
}]

CPU configuration:

$ lscpu  | grep -i numa
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-19,40-59
NUMA node1 CPU(s):                    20-39,60-79

Running stress-ng with 20 CPUs puts load on NUMA node 1
image

Similar test with LLCs:

[{
  "name":"llc_1",
  "comment":"llc1",
  "matches":[
     [
             {"CommPrefix":"stress-ng"}
     ],[
             {"PcommPrefix":"stress-ng"}
     ]],
  "kind": {
    "Confined":{
      "util_range": [0.04, 0.50],
      "min_exec_us":50,
      "llcs":[1]
     }
  }
},{
  "name":"llc_XXXX",
  "comment":"llc 1, noop layer",
  "matches":[
     [{"CommPrefix":"not-a-task"}]
  ],
  "kind": {
    "Confined":{
      "util_range": [0.01, 0.20],
      "min_exec_us":50,
      "llcs":[1]
     }
  }
},
{
  "name":"llc_0",
  "comment":"the rest",
  "matches":[[]],
  "kind":{
    "Confined": {
      "util_range": [0.05, 0.80],
      "min_exec_us":0,
      "llcs":[0]
    }
  }
}]

Running stress-ng shows similar results for NUMA pinning as this machine only has a single LLC per NUMA node.
image

Add a cpus method various subfields in the topology struct to easily get
the map of CPUs for nodes/LLCs.

Signed-off-by: Daniel Hodges <[email protected]>
Add NUMA node topology awareness for scx_layared. This borrows some of
the NUMA handling from scx_rusty and allows layers to set a node mask.
Different layer kinds will use the node mask differently.

Signed-off-by: Daniel Hodges <[email protected]>
@@ -86,6 +88,18 @@ struct cpu_ctx {
u64 ran_current_for;
};

struct cache_ctx {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might not be using this, so I think I can delete this.

return 0;
}

static s32 create_node(u32 node_id)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was mostly highjacked from scx_rusty

Copy link
Contributor

@htejun htejun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good as the first step. That said, the overhead of looping in the dispatch path may be noticeable under contention and the resulting behavior might not be ideal (e.g. DSQs for lower numbered LLCs are always favored). Given that there are a lot more changes needed around dispatch path and LB, I think iterating in tree makes sense, so please feel free to land it and keep building on top.

// return the dsq id for the layer based on the LLC id.
static inline u64 layer_dsq_id(u32 layer_id, u32 llc_id)
{
return (layer_id*nr_llcs) + llc_id;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you format it like (layer_id * nr_llcs) + llc_id?

@@ -649,7 +695,9 @@ void BPF_STRUCT_OPS(layered_enqueue, struct task_struct *p, u64 enq_flags)
goto find_cpu;
}

scx_bpf_dispatch_vtime(p, tctx->layer, slice_ns, vtime, enq_flags);
u32 llc_id = cpu_to_llc_id(tctx->last_cpu >= 0 ? tctx->last_cpu: 0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, plesae keep a space between operators and tokens.

if (layers[idx].preempt && scx_bpf_consume(idx))
return;
bpf_for(idx, 0, nr_layers) {
bpf_for(llc_id, 0, nr_llcs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose there's no automatic locality implemented in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but that should hopefully be an easy win.

bpf_for(idx, 0, nr_layers) {
bpf_for(llc_id, 0, nr_llcs) {
dsq_id = layer_dsq_id(idx, llc_id);
if (layers[idx].preempt && scx_bpf_consume(dsq_id))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the number of DSQs can be pretty high, down the line, we probably need to make the walk more efficient, but let's revisit that later as this part of code would need substantial updates to support automatic locality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don't like these double nested for loops.

@hodgesds hodgesds merged commit 4042fc4 into sched-ext:main Jul 24, 2024
1 check passed
@hodgesds hodgesds deleted the layered-topo branch July 24, 2024 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants