-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scx_layered: lighten/reduce nested loops in layered dispatch #746
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I'm not convinced these were off by one... According to the docs https://docs.ebpf.io/linux/concepts/loops/ you get an inclusive left hand side and an exclusive right hand side, which would mean these loops are correct, and we do very similar loops in userspace from 0..nr_cpus
and they don't have any issues.
Have you tried running a longer performance test, or running a config with 16 layers? I think the reason this is working is the layers
array in BPF always has 16 elements so it's fine to iterate further than is necessary, and I'm really hoping the performance test is a bug because I can't explain this otherwise!
This helps a lot thanks. I'm not super familiar with this code, but if you're confident that isn't the issue, it not likely the issue.
I'm reasonably confident it isn't. So like, best I can tell, migrations have something to do with the performance issues w/ layered. If there are no off by 1 errors, IIUC the code right, I think I effectively short circuited a bunch of logic in dispatch (which may be the problem)? |
This one barely changes stress output, but, the changes kind-of make sense (and I'm curious to see how the histograms in CI look w/ this change). This one flattens the loops in dispatch as much as can be done w/o changing the logic and also pulls as much logic out of that loop as is possible (in part for the verifier/instruction limit, in part because the cost of that code would go up with bigger machines/more layers).
|
4924823
to
e0ce471
Compare
I think (like, maybe think) this improves the responsiveness of layered by fixing some off-by-1 error.
Interesting thing about the results below:
18.72/15.60=1.2 (main fork user sec / branch fork user sec)
146579/131561=1.114152370 (main bogo ops / branch bogo ops)
9/7.66 = 1.17493473 (main cpu used % per instance / branch cpu used % per instance)
I think that, in combination with the rest of the results, is saying:
This change reduces the absolute number of instructions executed for this test, but it decreases how long it takes (in terms of time) for those instructions to be executed and how much CPU% those insns cost.