Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize remove_dir_all #4172

Closed
SUPERCILEX opened this issue Oct 14, 2021 · 12 comments
Closed

Parallelize remove_dir_all #4172

SUPERCILEX opened this issue Oct 14, 2021 · 12 comments
Labels
A-tokio Area: The main tokio crate M-fs Module: tokio/fs

Comments

@SUPERCILEX
Copy link
Contributor

The current implementation simply forwards to the stdlib, but this is the one fs operation that can actually take advantage of concurrency. As long as files within the same directory aren't deleted in parallel, the kernel shouldn't be locking anything and you'll get a full speedup.

@Darksonn Darksonn added A-tokio Area: The main tokio crate M-fs Module: tokio/fs labels Oct 15, 2021
@Darksonn
Copy link
Contributor

Well, I'm not convinced that a parallel remove_dir_all belongs in Tokio.

@SUPERCILEX
Copy link
Contributor Author

Why not?

@Darksonn
Copy link
Contributor

I mean, for one, there are no other methods in Tokio like this. More importantly, I really don't feel like maintaining an implementation of remove_dir_all considering that it turns out to be rather difficult. (see e.g. rust-lang/rust#29497)

@SUPERCILEX
Copy link
Contributor Author

Oh wow, that actually explains some bugs in other projects I've worked on! I wouldn't mind putting a cfg(not) for windows so only unix gets the faster version.

Here's my initial attempt:

async fn fast_remove_dir_all(path: &Path) -> io::Result<()> {
    let path = path.to_path_buf();
    let path = tokio::task::spawn_blocking(|| -> io::Result<Option<PathBuf>> {
        let filetype = fs::symlink_metadata(&path)?.file_type();
        if filetype.is_symlink() {
            fs::remove_file(&path)?;
            Ok(None)
        } else {
            Ok(Some(path))
        }
    }).await??;

    match path {
        None => Ok(()),
        Some(path) => remove_dir_all_recursive(path).await,
    }
}

async fn remove_dir_all_recursive(path: PathBuf) -> io::Result<()> {
    let path_copy = path.clone();
    let tasks = tokio::task::spawn_blocking(move || -> io::Result<_> {
        let mut tasks = Vec::new();

        for child in fs::read_dir(&path)? {
            let child = child?;
            if child.file_type()?.is_dir() {
                tasks.push(spawn_remove_dir_all_recursive(&child.path()));
            } else {
                fs::remove_file(&child.path())?;
            }
        }

        Ok(tasks)
    }).await??;

    for result in futures::future::join_all(tasks).await {
        result??;
    }

    tokio::task::spawn_blocking(|| {
        fs::remove_dir(path_copy)
    }).await??;

    Ok(())
}

fn spawn_remove_dir_all_recursive(path: &Path) -> JoinHandle<io::Result<()>> {
    tokio::task::spawn(remove_dir_all_recursive(path.to_path_buf()))
}

It's kind of ugly IMO, but I've seen it be 4X faster than straight remove_dir_all:

2.2 million files

remove_dir_all
real    0m21.697s
user    0m1.501s
sys     0m19.107s

fast_remove_dir_all
real    0m8.022s
user    0m4.016s
sys     1m2.381s

---

1.3 million files

remove_dir_all
real    0m12.210s
user    0m0.844s
sys     0m11.179s

fast_remove_dir_all
real    0m3.509s
user    0m2.302s
sys     0m32.536s

---

12K files

remove_dir_all
real    0m0.251s
user    0m0.127s
sys     0m0.124s

fast_remove_dir_all
real    0m0.184s
user    0m0.205s
sys     0m0.285s

Dunno if you have ideas for improvements? I'm new to tokio, so I'm guessing there's some more efficient way of doing things that I don't know about.

PS: I put together this crate to be able to reproducibly create these large file trees. Here were my params:

$ ftzz g ../tmp -n 2M -r 1000
$ ftzz g ../tmp -n 1M
$ ftzz g ../tmp -n 10K

@SUPERCILEX
Copy link
Contributor Author

Slightly faster and prettier version:

async fn fast_remove_dir_all(path: &Path) -> io::Result<()> {
    let path = path.to_path_buf();
    let path = tokio::task::spawn_blocking(|| -> io::Result<_> {
        let filetype = symlink_metadata(&path)?.file_type();
        if filetype.is_symlink() {
            remove_file(&path)?;
            Ok(None)
        } else {
            Ok(Some(path))
        }
    })
        .await??;

    match path {
        None => Ok(()),
        Some(path) => spawn_remove_dir_all_recursive(path).await?,
    }
}

async fn remove_dir_all_recursive(path: PathBuf) -> io::Result<()> {
    let mut tasks = Vec::new();

    for child in read_dir(&path)? {
        let child = child?;
        if child.file_type()?.is_dir() {
            tasks.push(spawn_remove_dir_all_recursive(child.path()));
        } else {
            remove_file(&child.path())?;
        }
    }

    for task in tasks {
        task.await??;
    }

    remove_dir(path)
}

#[inline]
fn spawn_remove_dir_all_recursive(path: PathBuf) -> JoinHandle<io::Result<()>> {
    tokio::task::spawn_blocking(|| {
        futures::executor::block_on(remove_dir_all_recursive(path))
    })
}

@SUPERCILEX
Copy link
Contributor Author

@Darksonn what are the next steps? Is this something that could go into tokio (assuming an alternative to futures::executor::block_on is found)? If not, would it be ok to have the tokio::fs::remove_dir_all method link to a crate that implements this?

@Darksonn
Copy link
Contributor

I would be ok with adding a link to the docs, however the implementation you have posted here is incorrect. Imagine the following situation:

We are deleting a directory with 1000 sub-directories. Each sub-directory has another empty sub-directory. The remove_dir_all_recursive function gets called on the top-level directory and spawns 1000 sub-tasks, one for each directory. Then, around 500 of those tasks are started (the rest can't start due to the spawn_blocking threadpool size limit), and each of those 500 spawn another sub-task each, but those sub-tasks are also unable to run due to the spawn_blocking threadpool size limit. This is a deadlock since none of the 500 tasks can ever quit because they are waiting for another task to start, but those other tasks cannot start because the 500 tasks are there.

@Darksonn
Copy link
Contributor

As for block_on, you can use a Handle for it.

@Darksonn
Copy link
Contributor

Darksonn commented Nov 2, 2021

I'm going to close this. We are not currently interested in the feature because a correct implementation is too difficult, and it should be prototyped in an external crate first. I would be happy to add a link to such an external crate if you write one.

@Darksonn Darksonn closed this as completed Nov 2, 2021
@SUPERCILEX
Copy link
Contributor Author

Sounds good! This is still on my plate, but I probably won't get to it for a few weeks.

@rbtcollins
Copy link

I will note the remove_dir_all crate exists specifically to deal with the complexities documented here, and it has a parallel implementation for Windows already. https://github.com/XAMPPRocky/remove_dir_all

SUPERCILEX added a commit to SUPERCILEX/fuc that referenced this issue Oct 9, 2022
This finally beats the shitty implementation I posted here: tokio-rs/tokio#4172 (comment)

Signed-off-by: Alex Saveau <[email protected]>
@SUPERCILEX
Copy link
Contributor Author

Finally got around to this if anyone was wondering (haven't officially released yet though): https://github.com/SUPERCILEX/fuc/tree/master/rmz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-fs Module: tokio/fs
Projects
None yet
Development

No branches or pull requests

3 participants