-
Notifications
You must be signed in to change notification settings - Fork 30.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fs: partition readFile to avoid threadpool exhaustion #17054
Conversation
49d3297
to
16195e0
Compare
Working on linter errors. |
Thanks for following up, pinging @nodejs/fs for review. |
I can see how this is a concern in a theoretical sense but I don't remember any bug reports where it was an actual issue. Seems like premature (de)optimization. |
@bnoordhuis I wouldn't call this a de-optimization. It optimizes the throughput of the threadpool in its entirety by increasing the number of requests that a readFile makes. It's an optimization for throughput, at the cost of the latency of large readFiles. I think this is in the spirit of Node.js: handle many client requests simultaneously on a small number of threads, and don't do too much work in one shot on any of the threads. This is already the approach taken by a readStream. The benchmark/fs/readfile-clogging.js demonstrates this:
FWIW The latency cost can be largely mitigated with the use of a more reasonably-sized buffer. The 1-thread numbers for readfile.js improve to a 30% degradation and the 10-thread numbers are comparable to the non-partitioned performance. benchmark/fs/readfile.js Here's the (rounded) readFile throughout for various read lengths on the 16MB file: With an 8KB buffer:
With a 64KB buffer:
With the full 16MB file in one shot:
benchmark/fs/readfile-clogging.js Here's the (rounded) zip throughout for various read sizes: With an 8KB buffer:
With a 64KB buffer:
With the full 16MB file in one shot:
Conclusion Admittedly the zip operation I'm using in |
benchmark/fs/readfile-clogging.js
Outdated
console.log(`bench ended, reads ${reads} zips ${zips}`); | ||
bench_ended = true; | ||
bench.end(reads); | ||
bench.end(zips); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling this twice does not make sense and will break compare.js which is expecting one bench.end()
per benchmark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mscdex Thanks for pointing this out. How ought I report throughput for two separate variables like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't. Perhaps just combine both values for total fulfilled requests per second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'll leave in the console.log then so the distinction between request type is clear.
benchmark/fs/readfile-clogging.js
Outdated
bench.end(reads); | ||
bench.end(zips); | ||
try { fs.unlinkSync(filename); } catch (e) {} | ||
process.exit(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't really safe since process.send()
used by bench.end()
is not synchronous. It's better to just return early in afterRead()
and afterZip()
when bench_ended === true
and let the process exit naturally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this process.exit(0)
is just a copy/paste from benchmark/fs/readfile.js
. I'll fix this in both places.
benchmark/fs/readfile-clogging.js
Outdated
|
||
var reads = 0; | ||
var zips = 0; | ||
var bench_ended = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, but lower camelCase is typically used in JS portions of node core and underscores are typically used in C++ portions.
I understand that. My point is that no one complained so far - people aren't filing bug reports. To me that suggests it's mostly a theoretical issue. Meanwhile, the proposed changes will almost certainly regress some workloads and people are bound to take notice of that. |
True. But other workloads should be accelerated -- it's the readfile.js vs. readfile-clogging.js tradeoff. |
I think I agree with Ben here. Anyone who wants a file read in chunks can just use |
maybe add a separate API instead? |
And if they want to make giant reads, they can use But if they've opted for the simplicity of |
On an unrelated note, please do not @ mention me in commit messages. |
6598119
to
b9971e4
Compare
Fixed, sorry. |
Hello @davisjam and thank you for the contribution 🎩 Sure looks like you did a lot of research and experimentation, and I really appreciate that.
There are several assumptions and rule-of-thumb optimizations around the uv threadpool [addition: based on empirical experience, and feedback]. One of those is that since the pool serves I/O bound operations, a small pool is enough. As such doing multiple interleaved FS operations is an anti-pattern. |
@refack Perhaps I misunderstand you, but this PR does not increase concurrency. With my patch, an I agree that a small pool is good for certain activities, but reading large files in one shot is not one of them. A small pool suffices so long as each task doesn't take too long, but a long-running task on a small pool monopolizes its assigned thread, degrading the task throughput of the pool. Then indeed (one thread in) "the uv threadpool is all consumed doing the same operation." [addition: The trouble is that on Linux, a thread still performs the I/O-bound task synchronously, since approaches like KAIO have been rejected (see here). So if the task is long-running, the thread blocks for a long time.] If the threadpool is used solely for "large" tasks, there's no problem -- each task takes a long time anyway, and partitioning them just adds overhead. But if the threadpool is used for a mix of larger and smaller tasks (e.g. serving different sized files to different clients, running compression and file I/O concurrently, etc.), then the larger tasks will harm the throughput of the smaller tasks. In my benchmark/fs/readfile-clogging.js benchmark, the small task throughput improves by 50x if you partition the large reads. |
I definitely appreciate the work here, but I think I'm also falling on the -1 side on this. I think a better approach would be to simply increase efforts to warn dev's away from using |
@jasnell Thanks for your input!
My 64KB benchmark suggests that scripts that use |
Right, but the current |
Found a few minutes for some deeper benchmarking...I've collected measurements across a range of partition sizes to give a better sense of the tradeoffs between degrading readFile performance and improving threadpool throughput. I looked at the following partition sizes in KB: 4 8 16 32 64 128 256 512 1024 4096 16384. At each stage I doubled the partition size until I reached 1024 KB (1MB), at which point I quadrupled to 4MB and again to 16MB. The final partition size, 16384KB (16MB), is the size of the file being read, so this last size is the baseline, equivalent to the current behavior of Node. The numbers I'm reporting represent a single run of the benchmarks on the machine described above, which is otherwise idle. Since it's just one run for each partition size, these numbers are just an estimate. Excerpting the "1 and 10 concurrent readFile's on a 16MB file" performance from benchmark/fs/readfile.js:
Excerpting the "10 concurrent readFile's on a 16MB file" performance from benchmark/fs/readfile-clogging.js:
Summarizing these results:
Recommendation: The 1-reader case seems pretty unrealistic, so let's focus on the 10-reader case. It looks to me like if we go with a 64KB partition, for pure readFile we face somewhere between a 10% drop in throughput (reported earlier) and a negligible drop in throughput (in this data). For this we get a 50x improvement in throughput for contending threadpool jobs. For better readFile performance, a larger blocksize could be used while still improving overall threadpool throughput. Since the patch is a one-liner, nothing fancy, this seems like a pretty good trade to me. As has been discussed, best practice is certainly not to use fs.readFile for serving files. But for users who are doing so, I think this patch could give them nice performance improvements for free. Docs: |
Actually, I just checked the crypto module. It does not chunk/partition large requests. The following example will not print "Short buf finished" until there are no more long requests in the threadpool queue and one of the workers picks up the short request. var nBytes = 10 * 1024*1024; /* 50 MB */
var nLongRequets = 20;
const crypto = require('crypto');
for (var i = 0; i < nLongRequets; i++) {
crypto.randomBytes(nBytes, (buf) => {
console.log('Long buf finished');
});
}
crypto.randomBytes(1, (buf) => {
console.log('Short buf finished');
});
console.log('begin'); Thoughts on a similar PR to partition large crypto requests like this, or a doc-change PR like #17154 with a warning? For the FS there are alternatives to |
@nodejs/crypto (see #17054 (comment)) |
Same as #17054 (comment) with the addendum that As well, there is hardly ever a reason to request more than a few hundred bytes of randomness at a time. I don't think large requests are a practical concern. |
@addaleax Yes. |
Landed in 67a4ce1 |
Problem: Node implements fs.readFile as: - a call to stat, then - a C++ -> libuv request to read the entire file using the stat size Why is this bad? The effect is to place on the libuv threadpool a potentially-large read request, occupying the libuv thread until it completes. While readFile certainly requires buffering the entire file contents, it can partition the read into smaller buffers (as is done on other read paths) along the way to avoid threadpool exhaustion. If the file is relatively large or stored on a slow medium, reading the entire file in one shot seems particularly harmful, and presents a possible DoS vector. Solution: Partition the read into multiple smaller requests. Considerations: 1. Correctness I don't think partitioning the read like this raises any additional risk of read-write races on the FS. If the application is concurrently readFile'ing and modifying the file, it will already see funny behavior. Though libuv uses preadv where available, this doesn't guarantee read atomicity in the presence of concurrent writes. 2. Performance Downside: Partitioning means that a single large readFile will require into many "out and back" requests to libuv, introducing overhead. Upside: In between each "out and back", other work pending on the threadpool can take a turn. In short, although partitioning will slow down a large request, it will lead to better throughput if the threadpool is handling more than one type of request. Fixes: nodejs#17047 PR-URL: nodejs#17054 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Tiancheng "Timothy" Gu <[email protected]> Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Matteo Collina <[email protected]> Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Ruben Bridgewater <[email protected]>
This broke our CI. I did not realize it right away and landed a couple other commits afterwards, otherwise I would have reverted this. A change landed a few hours before this one that changed the tmpDir behavior and broke the test from this PR. I am submitting a fix. |
PR-URL: #17610 Refs: #17054 (comment) Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Evan Lucas <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Jeremiah Senkpiel <[email protected]> Reviewed-By: Jon Moss <[email protected]> Reviewed-By: Ruben Bridgewater <[email protected]>
PR-URL: #17610 Refs: #17054 (comment) Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Evan Lucas <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Jeremiah Senkpiel <[email protected]> Reviewed-By: Jon Moss <[email protected]> Reviewed-By: Ruben Bridgewater <[email protected]>
PR-URL: #17610 Refs: #17054 (comment) Reviewed-By: Anna Henningsen <[email protected]> Reviewed-By: Evan Lucas <[email protected]> Reviewed-By: Colin Ihrig <[email protected]> Reviewed-By: Jeremiah Senkpiel <[email protected]> Reviewed-By: Jon Moss <[email protected]> Reviewed-By: Ruben Bridgewater <[email protected]>
Problem: Node implements fs.readFile as: - a call to stat, then - a C++ -> libuv request to read the entire file using the stat size Why is this bad? The effect is to place on the libuv threadpool a potentially-large read request, occupying the libuv thread until it completes. While readFile certainly requires buffering the entire file contents, it can partition the read into smaller buffers (as is done on other read paths) along the way to avoid threadpool exhaustion. If the file is relatively large or stored on a slow medium, reading the entire file in one shot seems particularly harmful, and presents a possible DoS vector. Solution: Partition the read into multiple smaller requests. Considerations: 1. Correctness I don't think partitioning the read like this raises any additional risk of read-write races on the FS. If the application is concurrently readFile'ing and modifying the file, it will already see funny behavior. Though libuv uses preadv where available, this doesn't guarantee read atomicity in the presence of concurrent writes. 2. Performance Downside: Partitioning means that a single large readFile will require into many "out and back" requests to libuv, introducing overhead. Upside: In between each "out and back", other work pending on the threadpool can take a turn. In short, although partitioning will slow down a large request, it will lead to better throughput if the threadpool is handling more than one type of request. Fixes: nodejs#17047 PR-URL: nodejs#17054 Reviewed-By: Benjamin Gruenbaum <[email protected]> Reviewed-By: Tiancheng "Timothy" Gu <[email protected]> Reviewed-By: Gireesh Punathil <[email protected]> Reviewed-By: James M Snell <[email protected]> Reviewed-By: Matteo Collina <[email protected]> Reviewed-By: Sakthipriyan Vairamani <[email protected]> Reviewed-By: Ruben Bridgewater <[email protected]>
The comments above noted that this doesn't appear to cause a significant performance regression, but we're seeing a 7.6-13.5x drop in read throughput between 8.x and 10.x in both the readfile benchmark and our real-world benchmarks that heavily exercise The readfile benchmark (Ubuntu 16):
From what I can extract from the comments in this PR, either no degradation or a 3.6-4.8x degradation was expected for the len=16M cases. As for why I think it's because of this change, the benchmark below compares // npm i async
const fs = require("fs");
const async = require("async");
function chunked(filename, cb) {
fs.readFile(filename, cb);
}
function oneshot(filename, cb) {
// shoddy implementation -- leaks fd in case of errors
fs.open(filename, "r", 0o666, (err, fd) => {
if (err) return cb(err);
fs.fstat(fd, (err, stats) => {
if (err) return cb(err);
const data = Buffer.allocUnsafe(stats.size);
fs.read(fd, data, 0, stats.size, 0, (err, bytesRead) => {
if (err) return cb(err);
fs.close(fd, err => {
cb(err, data);
});
});
});
});
}
fs.writeFileSync("./test.dat", Buffer.alloc(16e6, 'x'));
function bm(method, name, cb) {
const start = Date.now();
async.timesSeries(50, (n, next) => {
method("./test.dat", next);
}, err => {
if (err) return cb(err);
const diff = Date.now() - start;
console.log(name, diff);
cb();
});
}
async.series([
cb => bm(chunked, "fs.readFile()", cb),
cb => bm(oneshot, "oneshot", cb)
])
We've switched to Is anyone else able to verify that this degradation exists and/or was expected? |
I’ve done some empirical tests, and I saw some degradation. I’d recommend you to open up a new issue based on your data as it is likely to get more attention. |
And tag me in it! :-) |
I used "Reference in new issue" to create #25740 from @zbjornson's comment. |
Problem
Node implements fs.readFile as a call to stat, followed by a C++ -> libuv request
to read the entire file based on the size reported by stat.
Why is this bad?
The effect is to place on the libuv threadpool a potentially-large read request,
occupying the libuv thread until it completes.
While readFile certainly requires buffering the entire file contents,
it can partition the read into smaller buffers (as is done on other read paths)
along the way to avoid threadpool squatting.
If the file is relatively large or stored on a slow medium,
reading the entire file in one shot seems particularly harmful,
and presents a possible DoS vector.
Downsides to partitioning?
Correctness: I don't think partitioning the read like this raises any additional risk of read-write races on the FS. If the application is concurrently readFile'ing and modifying the file, it will already see funny behavior. Though libuv uses preadv where available, this doesn't guarantee read atomicity in the presence of concurrent writes.
Performance implications:
a. Downside: Partitioning means that a single large readFile will be broken into many "out and back" requests to libuv, introducing overhead.
b. Upside: In between each "out and back", other work pending on the threadpool can take a turn. In short, although partitioning will slow down a large request, it will lead to better throughput if the threadpool is handling more than one type of request.
Related
It might be that writeFile has similar behavior. The writeFile path is a bit more complex and I didn't investigate carefully.
Fix approach
Simple -- instead of reading in one shot, partition the read length using kReadFileBufferLength.
Test
I introduced a new test to ensure that fs.readFile works for files smaller and larger than kReadFileBufferLength. It works.
Performance:
Machine details:
$ uname -a
Linux jamie-Lenovo-K450e 4.8.0-56-generic My contribution to this logo throw in. #61~16.04.1-Ubuntu SMP Wed Jun 14 11:58:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Excerpts from lscpu:
Architecture: x86_64
CPU(s): 8
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
Model name: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
CPU MHz: 1499.194
benchmark/fs/readfile.js
Summary
Benchmarks using benchmark/fs/readfile.js are unfavorable. I ran three iterations with my change and three with an unmodified version. Performance within a version was similar across the three iterations, so I report only the third iteration for each.
With partitioned read:
Without change:
As discussed above, the readfile.js benchmark doesn't tell the whole story. The contention of this PR is that the 16MB reads will clog the threadpool, disadvantaging other work contending for the threadpool. I've introduced a new benchmark to characterize this.
Benchmark summary: I copied readfile.js and added a small asynchronous zlib operation to compete for the threadpool. If a non-partitioned readFile is clogging the threadpool, there will be a relatively small number of zips.
Performance summary:
Partitioned:
Non-partitioned:
Issue:
This commit addresses #17047.
Checklist
make -j4 test
(UNIX), orvcbuild test
(Windows) passesAffected core subsystem(s)
fs