Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deadlock/hang when nesting jwalk within par_iter() #967

Closed
Byron opened this issue Aug 16, 2022 · 12 comments
Closed

deadlock/hang when nesting jwalk within par_iter() #967

Byron opened this issue Aug 16, 2022 · 12 comments

Comments

@Byron
Copy link

Byron commented Aug 16, 2022

To start, here is a minimal example to reproduce the issue: https://github.com/Byron/reproduce-rayon-hang .

To reproduce, you can run the following (link to code):

git clone  https://github.com/Byron/reproduce-rayon-hang
cd reproduce-rayon-hang
cargo run -- break

I could workaround it by setting the pool size higher than the size of all inputs, which happens when invoking cargo run above (without the -- break argument).

Despite me CCing @jessegrosjean as the author of jwalk, I hope bringing it up here might help resolve why this interaction between jwalk and rayon fails despite jwalk implementing public rayon traits.

A related issue seems to be #690 , which involves a par_bridge() instead of par_iter(). Even without par-bridge the issue persist.

The reason for me looking into this was the hang in starship/starship#4251 . I think the reason was that these users ran on computers with 4 cores, while starship would want to use par_iter() on 5 inputs, with one of these causing jwalk to run within the par_iter() as part of gitoxide's git_repository::discover().

It would definitely be valuable if we could find and fix the root cause of this as I fear this issue might pop up more often - by default, gitoxide uses jwalk and if any user of gitoxide happens to open a repository within a par_iter() context, this hang could occour for the portion of end-users who don't have enough cores or whose workload is higher.

Thanks for your help.

@davidkna
Copy link

I think in starship, the issue might also be that all available threads attempt to wait on the central OnceCell<context::Repo>, blocking all worker that jwalk attempts to spawn.

@cuviper
Copy link
Member

cuviper commented Aug 17, 2022

If code is blocking a rayon thread in a non-cooperative way, I don't know what rayon could do about it.

For example, with RAYON_NUM_THREADS=1, this spins forever:

use jwalk::WalkDir;

fn main() {
    rayon::scope(|_| {
        eprintln!("WalkDir…");
        for _entry in WalkDir::new(".") {}
        eprintln!("WalkDir completed");
    });
}

@Byron
Copy link
Author

Byron commented Aug 17, 2022

Screen Shot 2022-08-15 at 16 30 37

What I found interesting is that jwalk seems to continuously traverse the directory without ever finishing, like it can't make progress (in my example).

In the example right above, it's the same stacktrace.

Screen Shot 2022-08-17 at 08 27 22

This makes me hopeful that it's some interaction with the Bridge that jwalk is implementing that is subtly incorrect in these scenarios, so it can be fixed one way or another.

I'd be surprised if jwalk was able to continuously make up work for itself to keep blocking the one thread there is, maybe @jessegrosjean could chime in for clarification?

Or in other words, if jwalk would really be doing that, I'd expect it to happen without the outer rayon::scope(…) {}, but that seems to work fine even with only one thread. So at the very least, there must be some unhappy interaction between the two that should be fixable one way or another.

@cuviper
Copy link
Member

cuviper commented Dec 8, 2022

#997 should help with par_bridge deadlocking itself, but it looks like jwalk still goes wild here...

@jessegrosjean
Copy link

Sorry for not responding earlier, I'm using just Swift for the last year and haven't been using this code.

@cuviper Thanks for Rayon 1.6.1, it fixes the problem that I was running into that caused me to need the custom jwalk_par_bridge code. I've now removed that from jwalk main. That's good because it removes an unknown in jwalk, but as you note jwalk is still having a problem with the bug reported in this thread.

@Byron I have taken a quick look and can reproduce the problem you describe, but don't have any idea on solution yet. You would be amazed at how quickly jwalk, rayon, and rust all fall out of my understanding. I will try to load it all back in sometime next week and see if I have any ideas. Also if you feel like you could make more progress without me in the way I'm also happy to transfer jwalk to you for maintenance. I'm encouraged to see the various projects that you've built on it.

@Byron
Copy link
Author

Byron commented Dec 12, 2022

@jessegrosjean I am excited that you could reproduce the problem, and that you were able to reduce the surface that could cause this issue.

Also if you feel like you could make more progress without me in the way I'm also happy to transfer jwalk to you for maintenance. I'm encouraged to see the various projects that you've built on it.

The killer-feature really is its compatibility to walkdir and the built-in parallel sorting. Indeed it has been very important for my projects and being able to maintain it myself would be an honour I gladly accept. That should give me the push to finally understand how its machine works as well, the entire rayon interaction eludes me 😅. pdu has a parallel recursive implementation as well, but doesn't support sorting nor iteration, but I wonder if it was possible to implement jwalk without rayon entirely to ease maintenance.

@jessegrosjean
Copy link

jessegrosjean commented Dec 12, 2022

I wonder if it was possible to implement jwalk without rayon entirely to ease maintenance.

@Byron I can't say anything for sure, but I don't think so. jwalk is originally inspired by ignore which does implement the parallelism on its own. In my mind that part of the code gets messy, and I think the end result is slower they what rayon provides.

I think I kinda understand the problem now, though I don't have a fix yet. Also if you were to build own solution you would have to create own thread pool. All these problems are related to the fact that by default jwalk tries to share rayons global thread pool. If instead jwalk creates it own thread pool then problems go away. So there's always that outlet option if we need it (much easier then reimplementing parallelism) ... but it would also be nice for jwalk to always work when sharing the global rayon thread pool.

Maybe we do a zoom meeting sometime and I can explain any bits of the code (maybe I can anyway :)) you have questions on? I'm https://mastodon.social/@jessegrosjean or [email protected].

@Byron
Copy link
Author

Byron commented Dec 12, 2022

That's great news! And indeed, thread-pool management is something that is probably a bit undervalued by many as they don't actually notice it exists, even though they would if it didn't 😅. Knowing that having its own thread-pool can also resolve the issue is great as well, it's something gitoxide could then manage somehow. If rayon threads in ThreadPools would automatically shutdown after a while it should be simple and good enough, even though auto-shutdown didn't seem to be present or I failed to find it in the docs.

…but it would also be nice for jwalk to always work when sharing the global rayon thread pool.

Maybe over time the issue will be solved here in rayon so waiting is a possible cause of action. I am happy to do that as well.

Maybe we do a zoom meeting sometime and I can explain any bits of the code (maybe I can anyway :)) you have questions on?

That's a fantastic idea 🎉🙏. Maybe it could be some sort of hand-over all-in-one, and I'd be looking forward to meet the fellow Rustacean upon which a lot of my software depends :D. I will send you an email with a calendly link so you can find a time that works for you.

@jessegrosjean
Copy link

Maybe over time the issue will be solved here in rayon so waiting is a possible cause of action. I am happy to do that as well.

Maybe... though I think it's much more likely that that problem is on jwalk end :)

@Byron
Copy link
Author

Byron commented Dec 15, 2022

Coincidentally I have found another issue in jwalk that could cause the iterator to block even though it would eventually finish. This certainly is unrelated even though I hoped it would help.

This leaves us with a reproducible busy hang when running the crash.rs program (thanks @jessegrosjean for adding it).

Citing @cuviper followed by @jessegrosjean

If code is blocking a rayon thread in a non-cooperative way, I don't know what rayon could do about it.

Maybe... though I think it's much more likely that that problem is on jwalk end :)

I have a hunch that the busy hang is something that jwalk is causing itself, and that a preferred way of dealing with it would be to detect if such an issue could occour and abort with an error before we grind to a halt, or actually fix it somehow.

Thus I think it's fair to continue the quest on jwalk's side and thank everyone here for their support, it's much appreciated 🙏.

@Byron Byron closed this as not planned Won't fix, can't repro, duplicate, stale Dec 15, 2022
@Byron
Copy link
Author

Byron commented Dec 15, 2022

And I could validate that it is indeed jwalk that enters a yield loop, so I presume it's the logic here that should detect that no progress can be made somehow.

What's interesting is that the closure that produces results is never invoked, so I think I have a handle to fixing this now 🎉. After all, the threadpool is busy executing the task that tries to spawn another that it depends on, but cannot finish unless it gets results from a task that cannot be spawned. A clear dependency loop that has to be broken on jwalk's side.

@Byron
Copy link
Author

Byron commented Dec 15, 2022

And it's done: jwalk can now detect deadlocks and fail gracefully.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants