Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock? when using par_bridge() #690

Closed
gralpli opened this issue Sep 5, 2019 · 11 comments
Closed

Deadlock? when using par_bridge() #690

gralpli opened this issue Sep 5, 2019 · 11 comments

Comments

@gralpli
Copy link

gralpli commented Sep 5, 2019

I'm using rayon 1.2.0 on Windows 10 and rustc 1.36.0 stable. I'm not sure how to best report this bug and I don't even know if it is a bug in rayon at all, but I thought it's better when someone has a look at this. I can provide additional information as requested.

I've written the following code:

use jwalk::WalkDir;

let entries = WalkDir::new(path).into_iter().par_bridge().map(|entry| {
    entry.unwrap().path().display().to_string()
}).collect::<Vec<_>>();

I'm using jwalk 0.4.0; it uses rayon internally.

If I run this code on a big folder (C:\Users\Home or C:\) it sometimes hangs indefinitely in the par_bridge.rs. It tries to aquire the lock on line 165, but always continues with the Err(TryLockError::WouldBlock) match arm.

@cuviper
Copy link
Member

cuviper commented Sep 5, 2019

I'm using jwalk 0.4.0; it uses rayon internally.

par_bridge() uses a mutex to share the sequential iterator. If that in turn uses rayon, I think you'll have issue #592. You could use WalkDir::num_threads(1) to avoid rayon internally.

I wonder if jwalk::WalkDir could implement IntoParallelIterator natively?
cc @jessegrosjean

@jessegrosjean
Copy link

jessegrosjean commented Sep 5, 2019

I wonder if jwalk::WalkDir could implement IntoParallelIterator natively?
cc @jessegrosjean

I'll open to suggestions... though I'm not sure I understand exactly how I would do that. Implementing ParallelIterator look a bit scary to me :)

Behind the scenes jwalk::WalkDir is using rayon to process the walk. Reading all entries in directory is one unit of work. So you only get parallelism when you are reading multiple folders (ie a recursive directory tree).

If you want to do expensive processing on each directory entry then you probably want per entry parallelism instead of per directory. I think best use of jwalk is to let it do its thing and return entries as fast as possible, and then if you need more parallelism apply that after you already have those entries. So maybe change above code to something like:

let entries = WalkDir::new(path).into_iter().collect::<Vec<_>>;
let paths = entries.par_iter().map(|each| entry.unwrap().path().display().to_string()).collect::<Vec<_>>;

@pkolaczk
Copy link

I confirm I'm hitting this problem on Linux as well.
As a workaround I'm using a channel + par bridge on the receiver side - then it doesn't hang. However this adds more complexity and possibly increases memory use because many entries need to be buffered until the receiver starts processing them.

A directly working par_bridge would be much better.

@jessegrosjean
Copy link

It's no ideal, but I think you can avoid lockup now by adding:

.parallelism(Parallelism::RayonNewPool(0))

To jwalk builder.

@cuviper
Copy link
Member

cuviper commented Apr 22, 2020

If RayonNewPool does what it sounds like, shifting the jwalk work to a private pool, beware that having one pool block on another still triggers that first pool to attempt work stealing while it waits. I'm not sure whether that will actually cause problems with par_bridge(), but I feel wary.

I may have mentioned this before, but I think we probably need some kind of "critical section" primitive in rayon-core to let a thread block without work-stealing. We would use this in par_bridge() when holding the internal mutex, and others might use this too for cases like #592.

@pkolaczk
Copy link

Yes, I also found 2 separate pools are mandatory, because even with an unbounded channel, the receiver can block and all rayon threads could get locked on receiving end, and there would be no threads left for jwalk, hence it would deadlock forever.

Unfortunately 2 pools are not good from perspective of system performance, because they add context switching between the producer and consumer sides. I noticed a big bounded channel (> 64k items) helps for performance at the expense of memory use.

I still would like to be able to do that all in a single rayon pool, but this looks surprisingly complex.

@pkolaczk
Copy link

pkolaczk commented Apr 24, 2020

My current solution:

pub fn walk_dirs(paths: Vec<PathBuf>, opts: WalkOpts) -> impl ParallelIterator<Item=PathBuf> {

    let (tx, rx) = sync_channel(65536);

    // We need to use a separate rayon thread-pool for walking the directories, because
    // otherwise we may get deadlocks caused by blocking on the channel.
    let thread_pool = Arc::new(
        ThreadPoolBuilder::new()
            .num_threads(opts.parallelism)
            .build()
            .unwrap());

    for path in paths {
        let tx = tx.clone();
        let thread_pool = thread_pool.clone();
        thread::spawn(move || {
            WalkDir::new(&path)
                .skip_hidden(opts.skip_hidden)
                .follow_links(opts.follow_links)
                .parallelism(Parallelism::RayonExistingPool(thread_pool))
                .into_iter()
                .for_each(move |entry| match entry {
                    Ok(e) if e.file_type.is_file() || e.file_type.is_symlink() =>
                        tx.send(e.path()).unwrap(),
                    Ok(_) =>
                        (),
                    Err(e) =>
                        eprintln!("Cannot access path {}: {}", path.display(), e)
                });
        });
    }

    rx.into_iter().par_bridge()
}

@untitaker
Copy link

untitaker commented Oct 24, 2020

Just hit this issue as well. Perhaps naive but the issue does go away if I switch par_bridge to a kind of reentrant mutex (#811), which I think should be safe. ignore me, I don't think reentrancy is safe here

@cuviper
Copy link
Member

cuviper commented Dec 8, 2022

I hope #997 will fix this, but I would appreciate folks here testing that.

@untitaker
Copy link

I can confirm this issue is fixed when upgrading to rayon-core 0.11 and jwalk 0.7. Upgrading both is necessary.

Didn't test older rayon-core where this was supposedly fixed (0.10.2)

@cuviper
Copy link
Member

cuviper commented Jun 25, 2023

Thanks for confirming!

@cuviper cuviper closed this as completed Jun 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants