-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[scx_lavd] Getting large pauses and stutters in-game #234
Comments
Did you verify the stutters don't happen when the scheduler isn't loaded? I've been having very similar issues with all of the cachyOS kernels CachyOS/linux-cachyos#244 . Im using the cachyos-eevdf kernel for instance and have very similar issues. |
I didn't have any issues with the cachyos kernel when I took a baseline, though it was a short test.
With the frequency in which the issue was occurring with scx_lavd (often), I would have expected to notice if it was also happening on default cachyos kernel and scheduler.
I can try playing for longer on the default cachyos kernel and scheduler to fully rule that out though.
…On 19 Apr 2024, 06:14, at 06:14, Mitchel Stewart ***@***.***> wrote:
Did you verify the stutters don't happen when the scheduler isn't
loaded? I've been having very similar issues with all of the cachyOS
kernels CachyOS/linux-cachyos#244
--
Reply to this email directly or view it on GitHub:
#234 (comment)
You are receiving this because you authored the thread.
Message ID: ***@***.***>
|
Same thing, heavy load scenarios time out the scheduler and it just unloads, after which program continues running without a problem. |
I tested compiling the rust program and playing the game at the same time, and got an scx_lavd error. log file : scx_lavd_dump-2024-04-20-1.txt edit : update log file. |
Thank you for reporting the issues. This will really helpful to improve and tune the LAVD. The preemption logic is under development and it will be improved next cycle. @ALL -- When collecting the log, if you can run @ChrisLane -- Thank you for opening the issue and share the log! @Cifer9516 -- If you don't mind, could you share the log and the game scenario that you played? @skygrango -- If you don't mind, please share the game scenario that you played? It seems that you ran the old version of scx_lavd. If possible, could you try the lastest version (git HEAD). |
here's a dump log of |
ran another test (commit 9a9b4d) and attached the full log here, but apologies if it's a bit hard to read since i use oh-my-zsh styling in my terminal. |
I think i'm facing the same issue here. Stuttering in game and then LAVD crashes. |
Thanks @bunkbail and @Galcian79 for sharing the logs. |
So, this suggests that 2869 has been sitting on the local DSQ for over 40s which should only be possible if either the previous task keeps running or tasks keep getting queued ahead of it. The latter happens only with explicit Now, a gaming thread being runnable constantly wouldn't be too surprising (maybe it's busy polling for some event?) but what should have happend is it running down its slice and then taking turns with other runnable tasks, which doesn't seem to have happened. Hmm... so, one peculiarity of
So, if I'm reading the code right, if the task keeps waking up the RT thread before its slice expires, the task can get its time slice refreshed indefinitely. If this is the case, the remedy may be: a. Distinguish the first I of course could be completely wrong, so please take it with a pinch of salt. From sched_ext core side, I've been thinking about adding the ability for BPF scheduler implementation to add scheduler specific information to the debug dump. Hopefully, that should make debugging from user reports easier. |
I was wondering if it could be helpful to have a way to easily migrate all tasks to the SCHED_EXT class, even RT/DL tasks (maybe excluding kthreads). At some point I had a similar issue with rustland and RT audio tasks, where audio was performing better (less audio cracks) by moving the RT tasks to the SCHED_EXT class. I did that from user-space (via schedtool), but having a way directly in sched-ext to "disable" the other sched classes (even if it sounds a bit too extreme) could be useful, especially for debugging, since we could reliably take out of the equation RT tasks. Opinions? |
after I compile the latest git version, scx_lavd no longer crashes. but when I run compilation with 100% cpu usage in the background, the game fps is still very unstable and low. |
Yeah, maybe. It's not technically difficult to implement but it's also kinda nice to have the guarantee that RT tasks will always run no matter what happens to scx schedulers. Lemme think over it. Thanks. |
I synced to git too. The issue is still there i think, but far less noticeable. Doesn't crash anymore. |
Wouldn't be useful to have some tool like scx_lavd --version? |
If you use the “git” version, you can always check what you are using. For example, in Arch Linux:
you can very easily indicate which commit I built from and then you know what version you have at your place (in this case, for the moment in which I publish a comment - the latest version). Your idea is not bad, although I see one drawback. Suppose we give version “0.1.0” - and now the question is, how would it be bumped up? What release? Every commit/pull request? I say in advance this is not a criticism just a request to develop the idea. |
I see now. So all the schedulers are tied to the same version. |
I mean your idea is not bad. Only you would have to work out how to update the version. However, checking what commit you are based on is also a good solution, because then you know globally what version you are using. |
I mean usually you would want something like x.x.x-dev. This is usually what binaries compiled from git report. But ofc concerning the -dev part you would always refer to the git commit. |
Edited my previous comments, adding the commit hash version of the build. Here's a new dump log of the latest build (commit 973aded), now it took longer to crash lavd (around 5 mins compared to previous version where it took just around 1 min to crash). Game was still lagging and freezing throughout. |
@htejun @arighi After looking at the logs, the watchdog time out happened only when a task is in a local DSQ. I seems your hypothesis is correct -- an RT task keeps preempting an scx task so scx task's time slide is replenished every preemption, running indefinitely and causing the watchdog timeout error. Here are my two suggestions:
What do you think? |
@multics69 I need to look better at the code (maybe @htejun or @Byte-Lab already have a better answer to this), but I think it should be possible already to replenish only the remaining time slice, implementing Maybe another approach to attack the problem could be to have multiple SCHED_EXT classes (following what Joel suggested here https://lore.kernel.org/all/CAEXW_YR02g=DetfwM98ZoveWEbGbGGfb1KAikcBeC=Pkvqf4OA@mail.gmail.com/). Having multiple SCHED_EXT classes would allow to have more options to decide which classes of tasks a sched-ext scheduler can grab, and eventually provide cmdline options to grab all tasks, or exclude RT tasks, etc. But I'm just brainstorming here... this is something to think about more. |
@arighi -- Yes, restoring the remaining time slice is possible using cpu_release/acquire() callbacks. However, if it could be the problem in all schedulers, I think it would be better to be implemented on the framework level. Again, @htejun and @Byte-Lab are the right person to answer this. hehe :-) Supporting multiple sched_ext classes sounds an interesting direction. I will also think about more. |
I don't think
Other than the immediate
|
Hmmm.... @arighi did mention excluding RT kthreads. That probably doesn't break any kernel mechanisms. I'll think more on it. |
Thank you @htejun for the detailed explanation. Setting the time slice at ops.running() seems to be the source of the problem as you explained. Regarding the scx iterator, I don't think the iterator helps much here. That is because the global DSQ is already vtime-based DSQ. And, in lavd, vtime of DSQ is task' deadline, so scx_bpf_dispatch() should always return a task with the earliest deadline. Or we set task's time slice at ops.dispatch() using scx iterator? Regarding the refresh_slice, I thought a transition such that ops.runnable() -> ops.running() -> ops.stopping() -> ops.running() ... is possible even without RT task preemption. I thought it could happen when a task's time slice is exhausted but a task is still in a runnable state. Am I correct? |
Yes, that's what I meant.
You're right. I guess that's why you put slice setting in |
Or rather, how about |
Oops, that won't work right now. Kernel will trigger zero slice warning. |
Another update from me. I've built lavd from the latest commit (d9ea53c) and also cherry picked 2 patches from PR #247 and PR #250. Now, lavd no longer crashes and the game no longer stutters/freezes like before. In my limited test, everything is smooth as butter. Awesome work @multics69 and @htejun! |
Apex Legends is FTP i could give it a try too. |
If you test it, make sure to enable the shader messages It's already enough to work in the Firing Range, no need to play online multiplayer. |
Even with GPL? |
@Galcian79 if you mean this, it's actually as stated in the docs. Shader compilation starts as soon as the mode (map) is selected in the menu, but it takes a while and keeps running during the game phase. If the caching is done, the game runs butter smooth but with a slight difference when enabling |
I will add here that CS2 main menu is like 10fps with lavd, runs fine without |
I started the game with fps unlocked, all graphic settings set to max. I played on the training ground waiting for dxvk to compile the shaders. Didn't notice any stutter. |
I dont know what your specs are but i run nvidia arch |
That's why i included my inxi. |
it doesnt show your specs... |
Which specs are we talking about? |
Fixed my op. |
Hey folks, I ran a test again now to attach my inxi.log + scx_lavd-01-05-2024.log Some facts:
Hope this helps, otherwise just ping me. |
Here is the log of another game. I just found logs cannot run for more than 3hrs. |
Thanks @DasLeo and @Galcian79 for narrowing down the problem. |
FYI, please test on the 6.9 linux-cachyos-rc kernel, since LAVD requires cpufreq support which is not included in the cachyos 6.8 Kernel currently. |
I wanted to test the latest lavd but I only have kernel 6.8.9 and latest lavd refuses to launch there. I tried reverting commit a24e1d7 (reverting commit 6892898 causes merge conflict) and build, but it still refuses to launch where pretty much other scx schedulers are running just fine. I'll wait until 6.9 hits stable then I'll be able to test lavd. |
@Galcian79 Thank you for sharing your log. Did you notice any stuttering? Is the problem getting worse or better? |
I haven't noticed any stuttering. |
Tried to build Cachy 6.9-rc7 (I guess) yesterday but I got BTF errors in the final vmlinux step, need to figure this out to start further testing. |
I was testing for a while today on 6.7-rc7 and hadn't had any issues nor micro-stutter. Great work folks! 👍👍👍 Here are my logs if you still wanna have a look: Will use the |
Same here. I'm on the stable 6.9 kernel, no micro-stutter during gaming and desktop responsiveness under load is also good. Great work @multics69! |
Compiled at fa1c146 and tested with the same setup but 6.9.0-rc7-5-cachyos-rc and didn't experience any of the severe stuttering that I was having before 🎉 Closing this issue. Thank you for the help! |
I read about scx_lavd on Phoronix and decided to give it a try and get some basic benchmarks.
I have observed that I am able to quite reliably reproduce stutters and pauses, sometimes several seconds long while running around my ship in Helldivers 2.
Here's the output from
sudo scx_lavd
while I was running the game with several stutters in the period (not sure on exact timestamps:https://gist.github.com/ChrisLane/1945f66b5d8a2f36a530ca9ac8abcfa1
Please let me know if I can do anything to debug or provide more useful info for your debugging purposes.
System:
Distro: Arch Linux
Kernel: 6.8.7-1-cachyos
GPU: AMD Radeon RX 6800 XT (RADV NAVI21)
CPU: AMD Ryzen 9 5900X 12-Core Processor
RAM: 36GB
The text was updated successfully, but these errors were encountered: