-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
layered: Fix static configuration, and dispatch for Grouped layers #148
Conversation
Byte-Lab
commented
Feb 21, 2024
- For static configurations of confined or grouped layers, we weren't properly propagating them to the kernel.
- We weren't trying to consume from the proper layer for grouped layers in layered_dispatch
We currently have a bug in layered wherein we could fail to propagate layer updates from user space to kernel space if a layer is never adjusted after it's first initialized. For example, in the following configuration: [ { "name": "workload.slice", "comment": "main workload slice", "matches": [ [ { "CgroupPrefix": "workload.slice/" } ] ], "kind": { "Grouped": { "cpus_range": [30, 30], "util_range": [ 0.0, 1.0 ], "preempt": false } } }, { "name": "normal", "comment": "the rest", "matches": [ [] ], "kind": { "Grouped": { "cpus_range": [2, 2], "util_range": [ 0.0, 1.0 ], "preempt": false } } } ] Both layers are static, and need only be resized a single time, so the configuration would never be propagated to the kernel due to us never calling update_bpf_layer_cpumask(). Let's instead have the initialization propagate changes to the skeleton before we attach the scheduler. This has the advantage both of fixing the bug mentioned above where a static configuration is never propagated to the kernel, and that we don't have a short period when the scheduler is first attached where we don't make optimal scheduling decisions due to the layer resizing not being propagated. Signed-off-by: David Vernet <[email protected]>
In layered_init, we're currently setting all bits in every layers' cpumask, and then asynchronously updating the cpumasks at later time to reflect their actual values at runtime. Now that we're updating the layered code to initialize the cpumasks before we attach the scheduler, we can instead have the init path actually refresh and initialize the cpumasks directly. Signed-off-by: David Vernet <[email protected]>
Currently, in layered_dispatch, we do the following: 1. Iterate over all preempt=true layers, and first try to consume from them. 2. Iterate over all confined layers, and consume from them if we find a layer with a cpumask that contains the consuming CPU. 3. Iterate over all grouped and open layers and consume from them in ordered sequence. In (2), we're only iterating over confined layers, but we should also be iterating over grouped layers. Otherwise, despite a consuming CPU being allocated to a specific grouped layer, the CPU will consume from whichever grouped or open layer has a task that's ready to run. Signed-off-by: David Vernet <[email protected]>
return; | ||
|
||
cpumaskw = bpf_map_lookup_elem(&layer_cpumasks, &idx); | ||
|
||
bpf_rcu_read_lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this protecting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without the rcu region, the verifier doesn't like the cpumask kfunc arg:
; bpf_for(cpu, 0, nr_possible_cpus) {
441: (3e) if w1 >= w2 goto pc+23 ; frame1: R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) R2_w=32 refs=9,28
; if (!cpumaskw || !cpumaskw->cpumask) {
442: (15) if r6 == 0x0 goto pc+46 ; frame1: R6=map_value(off=0,ks=4,vs=8,imm=0) refs=9,28
; if (!cpumaskw || !cpumaskw->cpumask) {
443: (79) r2 = *(u64 *)(r6 +0) ; frame1: R2_w=untrusted_ptr_or_null_bpf_cpumask(id=30,off=0,imm=0) R6=map_value(off=0,ks=4,vs=8,imm=0) refs=9,28
; if (!cpumaskw || !cpumaskw->cpumask) {
444: (15) if r2 == 0x0 goto pc+44 ; frame1: R2_w=untrusted_ptr_bpf_cpumask(off=0,imm=0) refs=9,28
;
445: (bc) w3 = w1 ; frame1: R1_w=scalar(id=31,smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) R3_w=scalar(id=31,smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) refs=9,28
446: (74) w3 >>= 3 ; frame1: R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=3,var_off=(0x0; 0x3)) refs=9,28
447: (bf) r4 = r7 ; frame1: R4_w=map_value(off=0,ks=4,vs=8476172,smin=smin32=0,smax=umax=smax32=umax32=1055280,var_off=(0x0; 0x183f78)) R7=map_value(off=0,ks=4,vs=8476172,smin=smin32=0,smax=umax=smax32=umax32=1055280,var_off=(0x0; 0x183f78)) refs=9,28
448: (0f) r4 += r3 ; frame1: R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=3,var_off=(0x0; 0x3)) R4_w=map_value(off=0,ks=4,vs=8476172,smin=smin32=0,smax=umax=smax32=umax32=1055283,var_off=(0x0; 0x183f7b)) refs=9,28
449: (07) r4 += 527572 ; frame1: R4_w=map_value(off=527572,ks=4,vs=8476172,smin=smin32=0,smax=umax=smax32=umax32=1055283,var_off=(0x0; 0x183f7b)) refs=9,28
; if (*u8_ptr & (1 << (cpu % 8))) {
450: (bc) w3 = w1 ; frame1: R1_w=scalar(id=31,smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) R3_w=scalar(id=31,smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) refs=9,28
451: (54) w3 &= 7 ; frame1: R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=7,var_off=(0x0; 0x7)) refs=9,28
; if (*u8_ptr & (1 << (cpu % 8))) {
452: (71) r4 = *(u8 *)(r4 +0) ; frame1: R4_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) refs=9,28
; if (*u8_ptr & (1 << (cpu % 8))) {
453: (7c) w4 >>= w3 ; frame1: R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=7,var_off=(0x0; 0x7)) R4_w=scalar() refs=9,28
454: (54) w4 &= 1 ; frame1: R4=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1)) refs=9,28
455: (16) if w4 == 0x0 goto pc+7 ; frame1: R4=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1)) refs=9,28
; bpf_cpumask_set_cpu(cpu, cpumaskw->cpumask);
456: (85) call bpf_cpumask_set_cpu#36671
R2 must be a rcu pointer
processed 237 insns (limit 1000000) max_states_per_insn 1 total_states 24 peak_states 24 mark_read 9
-- END PROG LOAD LOG --
Otherwise, cpumaskw->cpumask
could be freed out from under it on the ops.init()
path. I'm kinda confused though why it's safe to pass cpumaskw->cpumask
to the kfunc directly. I'd have expected us to need to put that into a local variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. The existing code is lacking the needed protection then. Hmm... why was it even loading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it's because it was previously only being called on paths with interrupts disabled. ops.init()
is sleepable, so it explicitly needs the extra RCU protection.