-
Notifications
You must be signed in to change notification settings - Fork 507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deadlock/hang when nesting jwalk
within par_iter()
#967
Comments
I think in starship, the issue might also be that all available threads attempt to wait on the central |
If code is blocking a rayon thread in a non-cooperative way, I don't know what rayon could do about it. For example, with
|
What I found interesting is that In the example right above, it's the same stacktrace. This makes me hopeful that it's some interaction with the Bridge that I'd be surprised if Or in other words, if |
#997 should help with |
Sorry for not responding earlier, I'm using just Swift for the last year and haven't been using this code. @cuviper Thanks for Rayon 1.6.1, it fixes the problem that I was running into that caused me to need the custom jwalk_par_bridge code. I've now removed that from jwalk main. That's good because it removes an unknown in jwalk, but as you note jwalk is still having a problem with the bug reported in this thread. @Byron I have taken a quick look and can reproduce the problem you describe, but don't have any idea on solution yet. You would be amazed at how quickly jwalk, rayon, and rust all fall out of my understanding. I will try to load it all back in sometime next week and see if I have any ideas. Also if you feel like you could make more progress without me in the way I'm also happy to transfer jwalk to you for maintenance. I'm encouraged to see the various projects that you've built on it. |
@jessegrosjean I am excited that you could reproduce the problem, and that you were able to reduce the surface that could cause this issue.
The killer-feature really is its compatibility to |
@Byron I can't say anything for sure, but I don't think so. jwalk is originally inspired by ignore which does implement the parallelism on its own. In my mind that part of the code gets messy, and I think the end result is slower they what rayon provides. I think I kinda understand the problem now, though I don't have a fix yet. Also if you were to build own solution you would have to create own thread pool. All these problems are related to the fact that by default jwalk tries to share rayons global thread pool. If instead jwalk creates it own thread pool then problems go away. So there's always that outlet option if we need it (much easier then reimplementing parallelism) ... but it would also be nice for jwalk to always work when sharing the global rayon thread pool. Maybe we do a zoom meeting sometime and I can explain any bits of the code (maybe I can anyway :)) you have questions on? I'm https://mastodon.social/@jessegrosjean or [email protected]. |
That's great news! And indeed, thread-pool management is something that is probably a bit undervalued by many as they don't actually notice it exists, even though they would if it didn't 😅. Knowing that having its own thread-pool can also resolve the issue is great as well, it's something
Maybe over time the issue will be solved here in
That's a fantastic idea 🎉🙏. Maybe it could be some sort of hand-over all-in-one, and I'd be looking forward to meet the fellow Rustacean upon which a lot of my software depends :D. I will send you an email with a calendly link so you can find a time that works for you. |
Maybe... though I think it's much more likely that that problem is on jwalk end :) |
Coincidentally I have found another issue in This leaves us with a reproducible busy hang when running the crash.rs program (thanks @jessegrosjean for adding it). Citing @cuviper followed by @jessegrosjean
I have a hunch that the busy hang is something that Thus I think it's fair to continue the quest on |
And I could validate that it is indeed What's interesting is that the closure that produces results is never invoked, so I think I have a handle to fixing this now 🎉. After all, the threadpool is busy executing the task that tries to spawn another that it depends on, but cannot finish unless it gets results from a task that cannot be spawned. A clear dependency loop that has to be broken on |
And it's done: |
To start, here is a minimal example to reproduce the issue: https://github.com/Byron/reproduce-rayon-hang .
To reproduce, you can run the following (link to code):
I could workaround it by setting the pool size higher than the size of all inputs, which happens when invoking
cargo run
above (without the-- break
argument).Despite me CCing @jessegrosjean as the author of
jwalk
, I hope bringing it up here might help resolve why this interaction betweenjwalk
andrayon
fails despitejwalk
implementing publicrayon
traits.A related issue seems to be #690 , which involves aEven without par-bridge the issue persist.par_bridge()
instead ofpar_iter()
.The reason for me looking into this was the hang in starship/starship#4251 . I think the reason was that these users ran on computers with 4 cores, while
starship
would want to usepar_iter()
on 5 inputs, with one of these causingjwalk
to run within thepar_iter()
as part ofgitoxide
'sgit_repository::discover()
.It would definitely be valuable if we could find and fix the root cause of this as I fear this issue might pop up more often - by default,
gitoxide
usesjwalk
and if any user ofgitoxide
happens to open a repository within apar_iter()
context, this hang could occour for the portion of end-users who don't have enough cores or whose workload is higher.Thanks for your help.
The text was updated successfully, but these errors were encountered: