Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scx_layered: lighten/reduce nested loops in layered dispatch #746

Merged
merged 4 commits into from
Oct 8, 2024

Conversation

likewhatevs
Copy link
Contributor

@likewhatevs likewhatevs commented Oct 7, 2024

I think (like, maybe think) this improves the responsiveness of layered by fixing some off-by-1 error.

Interesting thing about the results below:
18.72/15.60=1.2 (main fork user sec / branch fork user sec)
146579/131561=1.114152370 (main bogo ops / branch bogo ops)
9/7.66 = 1.17493473 (main cpu used % per instance / branch cpu used % per instance)

I think that, in combination with the rest of the results, is saying:
This change reduces the absolute number of instructions executed for this test, but it decreases how long it takes (in terms of time) for those instructions to be executed and how much CPU% those insns cost.

main

STRESS OUTPUT
stress-ng: info:  [295] setting to a 14 secs run per stressor
stress-ng: info:  [295] dispatching hogs: 32 cpu, 32 fork
stress-ng: info:  [295] note: /proc/sys/kernel/sched_autogroup_enabled is 1 and this can impact scheduling throughput for processes not attached to a tty. Setting this to 0 may improve performance metrics
stress-ng: metrc: [295] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [295]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [295] cpu              632537     14.00    279.98      0.62     45175.58        2254.20        62.63          6400
stress-ng: metrc: [295] fork             146579     14.00     18.72     21.60     10469.61        3635.57         9.00          2628
stress-ng: info:  [295] skipped: 0
stress-ng: info:  [295] passed: 63: cpu (31) fork (32)
stress-ng: info:  [295] failed: 0
stress-ng: info:  [295] metrics untrustworthy: 0
stress-ng: info:  [295] successful run completed in 14.02 secs
STRESS OUTPUT DONE
[   26.559795] ACPI: PM: Preparing to enter system sleep state S5
[   26.559945] kvm: exiting hardware virtualization
[   26.560172] reboot: Power down
[ble: elapsed 59.949s (CPU 554.7%)] ./local-e2e.sh scx_layered


branch

STRESS OUTPUT
stress-ng: info:  [296] setting to a 14 secs run per stressor
stress-ng: info:  [296] dispatching hogs: 32 cpu, 32 fork
stress-ng: info:  [296] note: /proc/sys/kernel/sched_autogroup_enabled is 1 and this can impact scheduling throughput for processes not attached to a tty. Setting this to 0 may improve performance metrics
stress-ng: metrc: [296] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [296]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [296] cpu              633144     14.00    279.68      0.96     45220.31        2256.05        62.64          6396
stress-ng: metrc: [296] fork             131561     14.00     15.60     18.72      9397.07        3833.80         7.66          2812
stress-ng: info:  [296] skipped: 0
stress-ng: info:  [296] passed: 63: cpu (31) fork (32)
stress-ng: info:  [296] failed: 0
stress-ng: info:  [296] metrics untrustworthy: 0
stress-ng: info:  [296] successful run completed in 14.02 secs
STRESS OUTPUT DONE
[   26.590950] ACPI: PM: Preparing to enter system sleep state S5
[   26.591101] kvm: exiting hardware virtualization
[   26.591221] reboot: Power down
[ble: elapsed 62.462s (CPU 528.8%)] ./local-e2e.sh scx_layered

@likewhatevs likewhatevs changed the title testing debugging layered delays Oct 7, 2024
Copy link
Contributor

@JakeHillion JakeHillion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I'm not convinced these were off by one... According to the docs https://docs.ebpf.io/linux/concepts/loops/ you get an inclusive left hand side and an exclusive right hand side, which would mean these loops are correct, and we do very similar loops in userspace from 0..nr_cpus and they don't have any issues.

Have you tried running a longer performance test, or running a config with 16 layers? I think the reason this is working is the layers array in BPF always has 16 elements so it's fine to iterate further than is necessary, and I'm really hoping the performance test is a bug because I can't explain this otherwise!

@likewhatevs
Copy link
Contributor Author

Hmm I'm not convinced these were off by one...

This helps a lot thanks. I'm not super familiar with this code, but if you're confident that isn't the issue, it not likely the issue.

and I'm really hoping the performance test is a bug because I can't explain this otherwise!

I'm reasonably confident it isn't. So like, best I can tell, migrations have something to do with the performance issues w/ layered.

If there are no off by 1 errors, IIUC the code right, I think I effectively short circuited a bunch of logic in dispatch (which may be the problem)?

@likewhatevs
Copy link
Contributor Author

This one barely changes stress output, but, the changes kind-of make sense (and I'm curious to see how the histograms in CI look w/ this change).

This one flattens the loops in dispatch as much as can be done w/o changing the logic and also pulls as much logic out of that loop as is possible (in part for the verifier/instruction limit, in part because the cost of that code would go up with bigger machines/more layers).

STRESS OUTPUT
stress-ng: info:  [295] setting to a 14 secs run per stressor
stress-ng: info:  [295] dispatching hogs: 32 cpu, 32 fork
stress-ng: info:  [295] note: /proc/sys/kernel/sched_autogroup_enabled is 1 and this can impact scheduling throughput for processes not attached to a tty. Setting this to 0 may improve performance metrics
stress-ng: metrc: [295] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [295]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [295] cpu              631934     14.00    279.92      0.67     45132.62        2252.18        62.62          6184
stress-ng: metrc: [295] fork             148975     14.00     18.72     22.42     10639.52        3620.95         9.18          2620
stress-ng: info:  [295] skipped: 0
stress-ng: info:  [295] passed: 63: cpu (31) fork (32)
stress-ng: info:  [295] failed: 0
stress-ng: info:  [295] metrics untrustworthy: 0
stress-ng: info:  [295] successful run completed in 14.02 secs
STRESS OUTPUT DONE

@likewhatevs
Copy link
Contributor Author

This branch is on the left, a random ci job on main is on the right:
image

Another random ci job:
image

It's hard to tell w/ all the variance between these jobs (there's reasonable examples in either direction), but I think I see this pulling the end state histogram towards 16,32 being the peak (and if it's doing anything at all, that would be amplified by more layers/more llc's).

@likewhatevs likewhatevs changed the title debugging layered delays reduce nested loops in layered dispatch Oct 8, 2024
@likewhatevs likewhatevs changed the title reduce nested loops in layered dispatch scx_layered: lighten/reduce nested loops in layered dispatch Oct 8, 2024
@likewhatevs likewhatevs added this pull request to the merge queue Oct 8, 2024
Merged via the queue into sched-ext:main with commit 2077b9a Oct 8, 2024
19 checks passed
@likewhatevs likewhatevs deleted the layered-delay branch October 8, 2024 11:34
likewhatevs added a commit to likewhatevs/scx that referenced this pull request Oct 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants