Deadlock? when using par_bridge() #690

gralpli · 2019-09-05T11:47:57Z

I'm using rayon 1.2.0 on Windows 10 and rustc 1.36.0 stable. I'm not sure how to best report this bug and I don't even know if it is a bug in rayon at all, but I thought it's better when someone has a look at this. I can provide additional information as requested.

I've written the following code:

use jwalk::WalkDir;

let entries = WalkDir::new(path).into_iter().par_bridge().map(|entry| {
    entry.unwrap().path().display().to_string()
}).collect::<Vec<_>>();

I'm using jwalk 0.4.0; it uses rayon internally.

If I run this code on a big folder (C:\Users\Home or C:\) it sometimes hangs indefinitely in the par_bridge.rs. It tries to aquire the lock on line 165, but always continues with the Err(TryLockError::WouldBlock) match arm.

The text was updated successfully, but these errors were encountered:

cuviper · 2019-09-05T16:08:26Z

I'm using jwalk 0.4.0; it uses rayon internally.

par_bridge() uses a mutex to share the sequential iterator. If that in turn uses rayon, I think you'll have issue #592. You could use WalkDir::num_threads(1) to avoid rayon internally.

I wonder if jwalk::WalkDir could implement IntoParallelIterator natively?
cc @jessegrosjean

jessegrosjean · 2019-09-05T17:10:07Z

I wonder if jwalk::WalkDir could implement IntoParallelIterator natively?
cc @jessegrosjean

I'll open to suggestions... though I'm not sure I understand exactly how I would do that. Implementing ParallelIterator look a bit scary to me :)

Behind the scenes jwalk::WalkDir is using rayon to process the walk. Reading all entries in directory is one unit of work. So you only get parallelism when you are reading multiple folders (ie a recursive directory tree).

If you want to do expensive processing on each directory entry then you probably want per entry parallelism instead of per directory. I think best use of jwalk is to let it do its thing and return entries as fast as possible, and then if you need more parallelism apply that after you already have those entries. So maybe change above code to something like:

let entries = WalkDir::new(path).into_iter().collect::<Vec<_>>;
let paths = entries.par_iter().map(|each| entry.unwrap().path().display().to_string()).collect::<Vec<_>>;

pkolaczk · 2020-04-22T03:51:02Z

I confirm I'm hitting this problem on Linux as well.
As a workaround I'm using a channel + par bridge on the receiver side - then it doesn't hang. However this adds more complexity and possibly increases memory use because many entries need to be buffered until the receiver starts processing them.

A directly working par_bridge would be much better.

jessegrosjean · 2020-04-22T12:51:58Z

It's no ideal, but I think you can avoid lockup now by adding:

.parallelism(Parallelism::RayonNewPool(0))

To jwalk builder.

cuviper · 2020-04-22T16:44:01Z

If RayonNewPool does what it sounds like, shifting the jwalk work to a private pool, beware that having one pool block on another still triggers that first pool to attempt work stealing while it waits. I'm not sure whether that will actually cause problems with par_bridge(), but I feel wary.

I may have mentioned this before, but I think we probably need some kind of "critical section" primitive in rayon-core to let a thread block without work-stealing. We would use this in par_bridge() when holding the internal mutex, and others might use this too for cases like #592.

pkolaczk · 2020-04-24T08:41:37Z

Yes, I also found 2 separate pools are mandatory, because even with an unbounded channel, the receiver can block and all rayon threads could get locked on receiving end, and there would be no threads left for jwalk, hence it would deadlock forever.

Unfortunately 2 pools are not good from perspective of system performance, because they add context switching between the producer and consumer sides. I noticed a big bounded channel (> 64k items) helps for performance at the expense of memory use.

I still would like to be able to do that all in a single rayon pool, but this looks surprisingly complex.

pkolaczk · 2020-04-24T08:45:38Z

My current solution:

pub fn walk_dirs(paths: Vec<PathBuf>, opts: WalkOpts) -> impl ParallelIterator<Item=PathBuf> {

    let (tx, rx) = sync_channel(65536);

    // We need to use a separate rayon thread-pool for walking the directories, because
    // otherwise we may get deadlocks caused by blocking on the channel.
    let thread_pool = Arc::new(
        ThreadPoolBuilder::new()
            .num_threads(opts.parallelism)
            .build()
            .unwrap());

    for path in paths {
        let tx = tx.clone();
        let thread_pool = thread_pool.clone();
        thread::spawn(move || {
            WalkDir::new(&path)
                .skip_hidden(opts.skip_hidden)
                .follow_links(opts.follow_links)
                .parallelism(Parallelism::RayonExistingPool(thread_pool))
                .into_iter()
                .for_each(move |entry| match entry {
                    Ok(e) if e.file_type.is_file() || e.file_type.is_symlink() =>
                        tx.send(e.path()).unwrap(),
                    Ok(_) =>
                        (),
                    Err(e) =>
                        eprintln!("Cannot access path {}: {}", path.display(), e)
                });
        });
    }

    rx.into_iter().par_bridge()
}

untitaker · 2020-10-24T19:06:29Z

~~Just hit this issue as well. Perhaps naive but the issue does go away if I switch par_bridge to a kind of reentrant mutex (#811), which I think should be safe.~~ ignore me, I don't think reentrancy is safe here

cuviper · 2022-12-08T23:31:50Z

I hope #997 will fix this, but I would appreciate folks here testing that.

untitaker · 2023-06-24T19:05:52Z

I can confirm this issue is fixed when upgrading to rayon-core 0.11 and jwalk 0.7. Upgrading both is necessary.

Didn't test older rayon-core where this was supposedly fixed (0.10.2)

See rayon-rs/rayon#690

cuviper · 2023-06-25T23:46:37Z

Thanks for confirming!

cuviper mentioned this issue Jul 27, 2020

Make it possible to disable the cross pool dispatch optimization. #765

Open

untitaker mentioned this issue Oct 24, 2020

Switch par-iter to reentrant mutex to avoid deadlock #811

Closed

Byron mentioned this issue Aug 16, 2022

deadlock/hang when nesting jwalk within par_iter() #967

Closed

cuviper mentioned this issue Dec 8, 2022

par_bridge: use naive locking of the Iterator #996

Merged

untitaker added a commit to untitaker/hyperlink that referenced this issue Jun 24, 2023

fix: Upgrade rayon and jwalk to get rid of intermediate vec

b4a5de2

See rayon-rs/rayon#690

untitaker mentioned this issue Jun 24, 2023

fix: Upgrade rayon and jwalk to get rid of intermediate vec untitaker/hyperlink#163

Merged

cuviper closed this as completed Jun 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock? when using par_bridge() #690

Deadlock? when using par_bridge() #690

gralpli commented Sep 5, 2019

cuviper commented Sep 5, 2019

jessegrosjean commented Sep 5, 2019 •

edited

Loading

pkolaczk commented Apr 22, 2020

jessegrosjean commented Apr 22, 2020

cuviper commented Apr 22, 2020

pkolaczk commented Apr 24, 2020

pkolaczk commented Apr 24, 2020 •

edited

Loading

untitaker commented Oct 24, 2020 •

edited

Loading

cuviper commented Dec 8, 2022

untitaker commented Jun 24, 2023

cuviper commented Jun 25, 2023

Deadlock? when using par_bridge() #690

Deadlock? when using par_bridge() #690

Comments

gralpli commented Sep 5, 2019

cuviper commented Sep 5, 2019

jessegrosjean commented Sep 5, 2019 • edited Loading

pkolaczk commented Apr 22, 2020

jessegrosjean commented Apr 22, 2020

cuviper commented Apr 22, 2020

pkolaczk commented Apr 24, 2020

pkolaczk commented Apr 24, 2020 • edited Loading

untitaker commented Oct 24, 2020 • edited Loading

cuviper commented Dec 8, 2022

untitaker commented Jun 24, 2023

cuviper commented Jun 25, 2023

jessegrosjean commented Sep 5, 2019 •

edited

Loading

pkolaczk commented Apr 24, 2020 •

edited

Loading

untitaker commented Oct 24, 2020 •

edited

Loading