Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

layered: Fix static configuration, and dispatch for Grouped layers #148

Merged
merged 3 commits into from
Feb 21, 2024

Conversation

Byte-Lab
Copy link
Contributor

  • For static configurations of confined or grouped layers, we weren't properly propagating them to the kernel.
  • We weren't trying to consume from the proper layer for grouped layers in layered_dispatch

We currently have a bug in layered wherein we could fail to propagate
layer updates from user space to kernel space if a layer is never
adjusted after it's first initialized. For example, in the following
configuration:

[
	{
		"name": "workload.slice",
		"comment": "main workload slice",
		"matches": [
			[
				{
					"CgroupPrefix": "workload.slice/"
				}
			]
		],
		"kind": {
			"Grouped": {
				"cpus_range": [30, 30],
				"util_range": [
					0.0,
					1.0
				],
				"preempt": false
			}
		}
	},
	{
		"name": "normal",
		"comment": "the rest",
		"matches": [
			[]
		],
		"kind": {
			"Grouped": {
				"cpus_range": [2, 2],
				"util_range": [
					0.0,
					1.0
				],
				"preempt": false
			}
		}
	}
]

Both layers are static, and need only be resized a single time, so the
configuration would never be propagated to the kernel due to us never
calling update_bpf_layer_cpumask(). Let's instead have the
initialization propagate changes to the skeleton before we attach the
scheduler.

This has the advantage both of fixing the bug mentioned above where a
static configuration is never propagated to the kernel, and that we
don't have a short period when the scheduler is first attached where we
don't make optimal scheduling decisions due to the layer resizing not
being propagated.

Signed-off-by: David Vernet <[email protected]>
In layered_init, we're currently setting all bits in every layers'
cpumask, and then asynchronously updating the cpumasks at later time to
reflect their actual values at runtime. Now that we're updating the
layered code to initialize the cpumasks before we attach the scheduler,
we can instead have the init path actually refresh and initialize the
cpumasks directly.

Signed-off-by: David Vernet <[email protected]>
Currently, in layered_dispatch, we do the following:

1. Iterate over all preempt=true layers, and first try to consume from
   them.

2. Iterate over all confined layers, and consume from them if we find a
   layer with a cpumask that contains the consuming CPU.

3. Iterate over all grouped and open layers and consume from them in
   ordered sequence.

In (2), we're only iterating over confined layers, but we should also be
iterating over grouped layers. Otherwise, despite a consuming CPU being
allocated to a specific grouped layer, the CPU will consume from
whichever grouped or open layer has a task that's ready to run.

Signed-off-by: David Vernet <[email protected]>
@Byte-Lab Byte-Lab requested a review from htejun February 21, 2024 21:40
return;

cpumaskw = bpf_map_lookup_elem(&layer_cpumasks, &idx);

bpf_rcu_read_lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this protecting?

Copy link
Contributor Author

@Byte-Lab Byte-Lab Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without the rcu region, the verifier doesn't like the cpumask kfunc arg:

; bpf_for(cpu, 0, nr_possible_cpus) {                                                                                                                                                    
441: (3e) if w1 >= w2 goto pc+23      ; frame1: R1_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) R2_w=32 refs=9,28                                              
; if (!cpumaskw || !cpumaskw->cpumask) {                                                                                                                                                 
442: (15) if r6 == 0x0 goto pc+46     ; frame1: R6=map_value(off=0,ks=4,vs=8,imm=0) refs=9,28                                                                                            
; if (!cpumaskw || !cpumaskw->cpumask) {
443: (79) r2 = *(u64 *)(r6 +0)        ; frame1: R2_w=untrusted_ptr_or_null_bpf_cpumask(id=30,off=0,imm=0) R6=map_value(off=0,ks=4,vs=8,imm=0) refs=9,28
; if (!cpumaskw || !cpumaskw->cpumask) {
444: (15) if r2 == 0x0 goto pc+44     ; frame1: R2_w=untrusted_ptr_bpf_cpumask(off=0,imm=0) refs=9,28
; 
445: (bc) w3 = w1                     ; frame1: R1_w=scalar(id=31,smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) R3_w=scalar(id=31,smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) refs=9,28
446: (74) w3 >>= 3                    ; frame1: R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=3,var_off=(0x0; 0x3)) refs=9,28
447: (bf) r4 = r7                     ; frame1: R4_w=map_value(off=0,ks=4,vs=8476172,smin=smin32=0,smax=umax=smax32=umax32=1055280,var_off=(0x0; 0x183f78)) R7=map_value(off=0,ks=4,vs=8476172,smin=smin32=0,smax=umax=smax32=umax32=1055280,var_off=(0x0; 0x183f78)) refs=9,28
448: (0f) r4 += r3                    ; frame1: R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=3,var_off=(0x0; 0x3)) R4_w=map_value(off=0,ks=4,vs=8476172,smin=smin32=0,smax=umax=smax32=umax32=1055283,var_off=(0x0; 0x183f7b)) refs=9,28
449: (07) r4 += 527572                ; frame1: R4_w=map_value(off=527572,ks=4,vs=8476172,smin=smin32=0,smax=umax=smax32=umax32=1055283,var_off=(0x0; 0x183f7b)) refs=9,28
; if (*u8_ptr & (1 << (cpu % 8))) {
450: (bc) w3 = w1                     ; frame1: R1_w=scalar(id=31,smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) R3_w=scalar(id=31,smin=smin32=0,smax=umax=smax32=umax32=31,var_off=(0x0; 0x1f)) refs=9,28
451: (54) w3 &= 7                     ; frame1: R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=7,var_off=(0x0; 0x7)) refs=9,28
; if (*u8_ptr & (1 << (cpu % 8))) {
452: (71) r4 = *(u8 *)(r4 +0)         ; frame1: R4_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=255,var_off=(0x0; 0xff)) refs=9,28
; if (*u8_ptr & (1 << (cpu % 8))) {
453: (7c) w4 >>= w3                   ; frame1: R3_w=scalar(smin=smin32=0,smax=umax=smax32=umax32=7,var_off=(0x0; 0x7)) R4_w=scalar() refs=9,28
454: (54) w4 &= 1                     ; frame1: R4=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1)) refs=9,28
455: (16) if w4 == 0x0 goto pc+7      ; frame1: R4=scalar(smin=smin32=0,smax=umax=smax32=umax32=1,var_off=(0x0; 0x1)) refs=9,28
; bpf_cpumask_set_cpu(cpu, cpumaskw->cpumask); 
456: (85) call bpf_cpumask_set_cpu#36671
R2 must be a rcu pointer
processed 237 insns (limit 1000000) max_states_per_insn 1 total_states 24 peak_states 24 mark_read 9
-- END PROG LOAD LOG --

Otherwise, cpumaskw->cpumask could be freed out from under it on the ops.init() path. I'm kinda confused though why it's safe to pass cpumaskw->cpumask to the kfunc directly. I'd have expected us to need to put that into a local variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. The existing code is lacking the needed protection then. Hmm... why was it even loading?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it's because it was previously only being called on paths with interrupts disabled. ops.init() is sleepable, so it explicitly needs the extra RCU protection.

@Byte-Lab Byte-Lab merged commit ebce76b into main Feb 21, 2024
1 check passed
@Byte-Lab Byte-Lab deleted the layered_fixes branch February 21, 2024 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants