-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
readDir performance improvements (1/2) #809
readDir performance improvements (1/2) #809
Conversation
This is a significant performance boost for high latency filesystems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, tested locally with both PRs merged and am seeing the speed up.
Our docs state node 8 as a minimum so in order for fs.promises to work this will need to be increased to 10. I don't know how big of a shock this will be to users. Ubuntu 18.04 still defaults to node 8
Throwing a loud error about needing node 10 might be enough.
It also requires at least v10.10 to expose the withFileTypes optimization. I've updated the engine restriction in package.json so that npm/yarn will throw an error if you try to update to a version of the validator without being on a new enough node release. |
Looks like circle might be using cache with old nodejs version in it. I don't have permissions to trigger a build sans cache. |
dd4f1a4
to
99a3307
Compare
This reworks utils.files.readDir to avoid the extra syscalls found in #807 and improve performance greatly for network filesystems.
The main improvement is using readdir(dir, { withFileTypes: true }) which allows us to skip the extra stat call for each file in the dataset since the fs.Dirent objects returned include the file type. On Linux + NFS this is a pretty big improvement since it results in one getdents64 call for each directory instead of a readdir call for each directory and a stat call for every file and directory.
The second improvement is avoiding transversal of ignored files entirely. If a directory is ignored, we used to scan the entire thing anyways and just prune it at the end. This prevents a lot of extra scans of the .git tree for DataLad datasets.
Some benchmarks - test suite on my workstation (ext4 + SSD):
Validating a real dataset with 105k files on AWS (NFSv4 client + EFS):