scx_layered: lighten/reduce nested loops in layered dispatch #746

likewhatevs · 2024-10-07T17:06:54Z

I think (like, maybe think) this improves the responsiveness of layered by fixing some off-by-1 error.

Interesting thing about the results below:
18.72/15.60=1.2 (main fork user sec / branch fork user sec)
146579/131561=1.114152370 (main bogo ops / branch bogo ops)
9/7.66 = 1.17493473 (main cpu used % per instance / branch cpu used % per instance)

I think that, in combination with the rest of the results, is saying:
This change reduces the absolute number of instructions executed for this test, but it decreases how long it takes (in terms of time) for those instructions to be executed and how much CPU% those insns cost.

main

STRESS OUTPUT
stress-ng: info:  [295] setting to a 14 secs run per stressor
stress-ng: info:  [295] dispatching hogs: 32 cpu, 32 fork
stress-ng: info:  [295] note: /proc/sys/kernel/sched_autogroup_enabled is 1 and this can impact scheduling throughput for processes not attached to a tty. Setting this to 0 may improve performance metrics
stress-ng: metrc: [295] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [295]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [295] cpu              632537     14.00    279.98      0.62     45175.58        2254.20        62.63          6400
stress-ng: metrc: [295] fork             146579     14.00     18.72     21.60     10469.61        3635.57         9.00          2628
stress-ng: info:  [295] skipped: 0
stress-ng: info:  [295] passed: 63: cpu (31) fork (32)
stress-ng: info:  [295] failed: 0
stress-ng: info:  [295] metrics untrustworthy: 0
stress-ng: info:  [295] successful run completed in 14.02 secs
STRESS OUTPUT DONE
[   26.559795] ACPI: PM: Preparing to enter system sleep state S5
[   26.559945] kvm: exiting hardware virtualization
[   26.560172] reboot: Power down
[ble: elapsed 59.949s (CPU 554.7%)] ./local-e2e.sh scx_layered


branch

STRESS OUTPUT
stress-ng: info:  [296] setting to a 14 secs run per stressor
stress-ng: info:  [296] dispatching hogs: 32 cpu, 32 fork
stress-ng: info:  [296] note: /proc/sys/kernel/sched_autogroup_enabled is 1 and this can impact scheduling throughput for processes not attached to a tty. Setting this to 0 may improve performance metrics
stress-ng: metrc: [296] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [296]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [296] cpu              633144     14.00    279.68      0.96     45220.31        2256.05        62.64          6396
stress-ng: metrc: [296] fork             131561     14.00     15.60     18.72      9397.07        3833.80         7.66          2812
stress-ng: info:  [296] skipped: 0
stress-ng: info:  [296] passed: 63: cpu (31) fork (32)
stress-ng: info:  [296] failed: 0
stress-ng: info:  [296] metrics untrustworthy: 0
stress-ng: info:  [296] successful run completed in 14.02 secs
STRESS OUTPUT DONE
[   26.590950] ACPI: PM: Preparing to enter system sleep state S5
[   26.591101] kvm: exiting hardware virtualization
[   26.591221] reboot: Power down
[ble: elapsed 62.462s (CPU 528.8%)] ./local-e2e.sh scx_layered

JakeHillion

Hmm I'm not convinced these were off by one... According to the docs https://docs.ebpf.io/linux/concepts/loops/ you get an inclusive left hand side and an exclusive right hand side, which would mean these loops are correct, and we do very similar loops in userspace from 0..nr_cpus and they don't have any issues.

Have you tried running a longer performance test, or running a config with 16 layers? I think the reason this is working is the layers array in BPF always has 16 elements so it's fine to iterate further than is necessary, and I'm really hoping the performance test is a bug because I can't explain this otherwise!

likewhatevs · 2024-10-07T19:04:58Z

Hmm I'm not convinced these were off by one...

This helps a lot thanks. I'm not super familiar with this code, but if you're confident that isn't the issue, it not likely the issue.

and I'm really hoping the performance test is a bug because I can't explain this otherwise!

I'm reasonably confident it isn't. So like, best I can tell, migrations have something to do with the performance issues w/ layered.

If there are no off by 1 errors, IIUC the code right, I think I effectively short circuited a bunch of logic in dispatch (which may be the problem)?

likewhatevs · 2024-10-07T22:41:08Z

This one barely changes stress output, but, the changes kind-of make sense (and I'm curious to see how the histograms in CI look w/ this change).

This one flattens the loops in dispatch as much as can be done w/o changing the logic and also pulls as much logic out of that loop as is possible (in part for the verifier/instruction limit, in part because the cost of that code would go up with bigger machines/more layers).

STRESS OUTPUT
stress-ng: info:  [295] setting to a 14 secs run per stressor
stress-ng: info:  [295] dispatching hogs: 32 cpu, 32 fork
stress-ng: info:  [295] note: /proc/sys/kernel/sched_autogroup_enabled is 1 and this can impact scheduling throughput for processes not attached to a tty. Setting this to 0 may improve performance metrics
stress-ng: metrc: [295] stressor       bogo ops real time  usr time  sys time   bogo ops/s     bogo ops/s CPU used per       RSS Max
stress-ng: metrc: [295]                           (secs)    (secs)    (secs)   (real time) (usr+sys time) instance (%)          (KB)
stress-ng: metrc: [295] cpu              631934     14.00    279.92      0.67     45132.62        2252.18        62.62          6184
stress-ng: metrc: [295] fork             148975     14.00     18.72     22.42     10639.52        3620.95         9.18          2620
stress-ng: info:  [295] skipped: 0
stress-ng: info:  [295] passed: 63: cpu (31) fork (32)
stress-ng: info:  [295] failed: 0
stress-ng: info:  [295] metrics untrustworthy: 0
stress-ng: info:  [295] successful run completed in 14.02 secs
STRESS OUTPUT DONE

likewhatevs · 2024-10-07T23:09:52Z

This branch is on the left, a random ci job on main is on the right:

Another random ci job:

It's hard to tell w/ all the variance between these jobs (there's reasonable examples in either direction), but I think I see this pulling the end state histogram towards 16,32 being the peak (and if it's doing anything at all, that would be amplified by more layers/more llc's).

was for old version

This reverts commit 2077b9a, reversing changes made to eb73005.

likewhatevs requested review from htejun, hodgesds and samuelnair October 7, 2024 17:06

likewhatevs changed the title ~~testing~~ debugging layered delays Oct 7, 2024

JakeHillion previously requested changes Oct 7, 2024

View reviewed changes

flatten and simplify dispatch

e0ce471

likewhatevs requested a review from JakeHillion October 7, 2024 22:37

likewhatevs force-pushed the layered-delay branch from 4924823 to e0ce471 Compare October 7, 2024 22:41

likewhatevs added 2 commits October 7, 2024 18:44

formatting

3369836

formatting

d2ac627

likewhatevs mentioned this pull request Oct 7, 2024

layered: split dispatch into no_topo version #749

Merged

formatting

51d9e90

likewhatevs changed the title ~~debugging layered delays~~ reduce nested loops in layered dispatch Oct 8, 2024

likewhatevs changed the title ~~reduce nested loops in layered dispatch~~ scx_layered: lighten/reduce nested loops in layered dispatch Oct 8, 2024

likewhatevs enabled auto-merge October 8, 2024 11:32

likewhatevs added this pull request to the merge queue Oct 8, 2024

Merged via the queue into sched-ext:main with commit 2077b9a Oct 8, 2024
19 checks passed

likewhatevs deleted the layered-delay branch October 8, 2024 11:34

likewhatevs added a commit to likewhatevs/scx that referenced this pull request Oct 9, 2024

Revert "Merge pull request sched-ext#746 from likewhatevs/layered-delay"

c90144d

This reverts commit 2077b9a, reversing changes made to eb73005.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scx_layered: lighten/reduce nested loops in layered dispatch #746

scx_layered: lighten/reduce nested loops in layered dispatch #746

likewhatevs commented Oct 7, 2024 •

edited

Loading

JakeHillion left a comment

likewhatevs commented Oct 7, 2024

likewhatevs commented Oct 7, 2024

likewhatevs commented Oct 7, 2024

scx_layered: lighten/reduce nested loops in layered dispatch #746

scx_layered: lighten/reduce nested loops in layered dispatch #746

Conversation

likewhatevs commented Oct 7, 2024 • edited Loading

JakeHillion left a comment

Choose a reason for hiding this comment

likewhatevs commented Oct 7, 2024

likewhatevs commented Oct 7, 2024

likewhatevs commented Oct 7, 2024

likewhatevs commented Oct 7, 2024 •

edited

Loading