readDir performance improvements (1/2) #809

nellh · 2019-08-02T07:23:53Z

This reworks utils.files.readDir to avoid the extra syscalls found in #807 and improve performance greatly for network filesystems.

The main improvement is using readdir(dir, { withFileTypes: true }) which allows us to skip the extra stat call for each file in the dataset since the fs.Dirent objects returned include the file type. On Linux + NFS this is a pretty big improvement since it results in one getdents64 call for each directory instead of a readdir call for each directory and a stat call for every file and directory.

The second improvement is avoiding transversal of ignored files entirely. If a directory is ignored, we used to scan the entire thing anyways and just prune it at the end. This prevents a lot of extra scans of the .git tree for DataLad datasets.

Some benchmarks - test suite on my workstation (ext4 + SSD):

	master	branch
real	0m31.568s	0m11.281s
user	0m49.635s	0m35.629s
sys	0m4.685s	0m3.266s

Validating a real dataset with 105k files on AWS (NFSv4 client + EFS):

	master	branch
real	3m27.462s	0m43.374s
user	0m9.932s	0m4.796s
sys	0m5.328s	0m1.252s

This is a significant performance boost for high latency filesystems.

rwblair

Looks great, tested locally with both PRs merged and am seeing the speed up.

Our docs state node 8 as a minimum so in order for fs.promises to work this will need to be increased to 10. I don't know how big of a shock this will be to users. Ubuntu 18.04 still defaults to node 8

Throwing a loud error about needing node 10 might be enough.

nellh · 2019-08-02T15:43:42Z

Looks great, tested locally and am seeing the speed up.

Our docs state node 8 as a minimum so in order for fs.promises to work this will need to be increased to 10. I don't know how big of a shock this will be to users. Ubuntu 18.04 still defaults to node 8

Throwing a loud error about needing node 10 might be enough.

It also requires at least v10.10 to expose the withFileTypes optimization. I've updated the engine restriction in package.json so that npm/yarn will throw an error if you try to update to a version of the validator without being on a new enough node release.

rwblair · 2019-08-05T15:44:00Z

Looks like circle might be using cache with old nodejs version in it. I don't have permissions to trigger a build sans cache.

nellh added 7 commits August 1, 2019 22:58

Refactor tests to allow asynchronous readDir.

5f7eef7

Workaround extra node.js warnings for fs.promises on Node 10

ca8ef76

Improve readDir test coverage for return value shape.

3836e9d

Rework readDir to avoid extraneous isDirectory and lstat calls.

c5ca9e9

This is a significant performance boost for high latency filesystems.

Remove unused readDir.isDirectory function.

d148e0f

Fix readDir test environment.

f581ee4

Bug fix for missing annexed files in quickTest.

16862ff

nellh requested review from rwblair and david-nishi August 2, 2019 07:23

nellh mentioned this pull request Aug 2, 2019

readDir performance improvement (2/2) #810

Merged

rwblair requested changes Aug 2, 2019

View reviewed changes

Raise minimum engine supported to 10.10 for readdir withFileTypes

2cd9f66

rwblair approved these changes Aug 2, 2019

View reviewed changes

Update readme node version to match package.json

933fe76

Update CircleCI to Node 10.16.1

ca69cb4

david-nishi approved these changes Aug 5, 2019

View reviewed changes

nellh added 2 commits August 5, 2019 10:35

Fix remoteFiles test suite on Node.js 10

7b6724a

Disable remoteFiles failure case tests.

99a3307

nellh force-pushed the fs-performance-improvements-1 branch from dd4f1a4 to 99a3307 Compare August 5, 2019 17:59

nellh merged commit fc10516 into bids-standard:master Aug 5, 2019

sappelhoff mentioned this pull request Aug 6, 2019

Yesterdays (2019-08-06) merged PRs break validation #812

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readDir performance improvements (1/2) #809

readDir performance improvements (1/2) #809

nellh commented Aug 2, 2019

rwblair left a comment •

edited

Loading

nellh commented Aug 2, 2019

rwblair commented Aug 5, 2019

readDir performance improvements (1/2) #809

readDir performance improvements (1/2) #809

Conversation

nellh commented Aug 2, 2019

rwblair left a comment • edited Loading

Choose a reason for hiding this comment

nellh commented Aug 2, 2019

rwblair commented Aug 5, 2019

rwblair left a comment •

edited

Loading