fs: `readFile` in one syscall to avoid context switching #41436

mmomtchev · 2022-01-07T20:41:50Z

This PR makes readFile as fast as readFileSync
In fact, there is no point in reading the file
in small chunks as the buffer is already allocated
The AbortController does not justify such a
huge performance hit
Refs: #41435

This PR makes `readFile` as fast as `readFileSync` In fact, there is no point in reading the file in small chunks as the buffer is already allocated The `AbortController` does not justify such a huge performance hit

aduh95 · 2022-01-07T23:23:21Z

Benchmark CI: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1085/

                                                              confidence improvement accuracy (*)    (**)   (***)
fs/readfile.js concurrent=10 len=1024 duration=5                             -2.46 %       ±7.39%  ±9.83% ±12.81%
fs/readfile.js concurrent=10 len=16777216 duration=5                 ***     19.21 %       ±3.00%  ±4.02%  ±5.29%
fs/readfile.js concurrent=1 len=1024 duration=5                               0.58 %       ±3.15%  ±4.20%  ±5.47%
fs/readfile.js concurrent=1 len=16777216 duration=5                          16.31 %      ±17.07% ±22.74% ±29.66%
fs/readfile-partitioned.js concurrent=10 len=1024 dur=5                       0.46 %       ±6.34%  ±8.44% ±10.98%
fs/readfile-partitioned.js concurrent=10 len=16777216 dur=5          ***    -44.38 %       ±2.33%  ±3.13%  ±4.13%
fs/readfile-partitioned.js concurrent=1 len=1024 dur=5                       -0.69 %       ±9.08% ±12.09% ±15.74%
fs/readfile-partitioned.js concurrent=1 len=16777216 dur=5             *     -4.70 %       ±4.13%  ±5.49%  ±7.15%
fs/readfile-promises.js concurrent=10 len=1024 duration=5                    -0.71 %      ±11.85% ±15.77% ±20.53%
fs/readfile-promises.js concurrent=10 len=16777216 duration=5                 3.21 %       ±3.59%  ±4.78%  ±6.23%
fs/readfile-promises.js concurrent=1 len=1024 duration=5                      0.34 %       ±3.66%  ±4.87%  ±6.34%
fs/readfile-promises.js concurrent=1 len=16777216 duration=5                  6.85 %       ±9.65% ±12.84% ±16.71%

mscdex · 2022-01-08T01:19:06Z

My guess is that it was done that way to be more fair to other operations in the thread pool.

EDIT: I think that's exactly what benchmark/fs/readfile-partitioned.js is testing and why it is seeing such a large regression.

mmomtchev · 2022-01-08T01:46:08Z

EDIT: I think that's exactly what benchmark/fs/readfile-partitioned.js is testing and why it is seeing such a large regression.

benchmark/fs/readfile-partitioned.js does both reading and compression (zlib deflate) in parallel - and both are completely independent - and reports the sum of the number of reads and compressed blocks - so if you shift the balance towards reads you get a skewed results because they are not 1:1

There is no reason for this test to be slower - it just reduces the number of context switches meaning threads tend to stay longer on the CPU - if the CPU has enough cores to accommodate all libuv threads.

mmomtchev · 2022-01-08T01:50:26Z

One of the most viewed Node.js questions without an answer

https://stackoverflow.com/questions/52648229/i-o-performance-in-node-js-worker-threads

mscdex

both are completely independent

They both use the thread pool, so the tasks are competing for available threads.

As I said, that particular benchmark configuration is showing that large(r) reads can block other tasks in the thread pool, which is why the regression shows up because fewer total tasks are being completed.

I think the behavior being changed in this PR should be opt-in instead of being forced.

Additionally, regarding your answer on that StackOverflow question: there is no dynamic thread creation happening. It's a thread pool, so threads are created once at startup and then are reused during the life of the process.

mmomtchev · 2022-01-08T12:51:08Z

@mscdex I was referring to the question I was answered - in which readFileSync is faster than readFile including the creation of the worker thread (V8 isolate) - not to the thread pool

mmomtchev · 2022-01-08T12:56:08Z

@addaleax @mscdex Isn't this the perfect moment to discuss increasing the default thread size from 4? It is a value inherited from another age when high-end CPUs rarely had more than 4 cores and 2MB was a lot of memory.
This is a general problem of one thread doing heavy I/O and another thread doing only CPU work (compression in this case) - normally these two threads could perfectly co-exist on a single core - but here we are hitting the thread pool limit.
When one is reading in small chunks, leaving the compression more CPU time, it is the OS readahead that comes into play - this is why this test is so heavily impacted. In this case, there is an additional kernel thread that does work.

mmomtchev · 2022-01-08T13:43:06Z

How about (it goes far beyond the scope of this PR, it just an idea) having half of the threads marked for CPU work and half of the threads marked for IO work, and when scheduling async work one is to specify the type of load - if there are enough threads to cover all cores for every type of work this arrangement would guarantee maximum performance in all cases. It is a huge change, I know, but it would bring an improvement all across the board.

mscdex · 2022-01-08T16:33:47Z

Isn't this the perfect moment to discuss increasing the default thread size from 4?

I think 4 is still a reasonable default. Besides, users can already change the threadpool size by setting the UV_THREADPOOL_SIZE environment variable.

How about (it goes far beyond the scope of this PR, it just an idea) having half of the threads marked for CPU work and half of the threads marked for IO work

I don't think that's going to be doable since addons can use the same threadpool and you would need to rely on them to do the right thing. Also, what if you have a task that uses both CPU and I/O (e.g. possibly via a third-party library function), which bucket would you put it in?

mmomtchev · 2022-01-08T17:52:19Z

I don't think that's going to be doable since addons can use the same threadpool and you would need to rely on them to do the right thing. Also, what if you have a task that uses both CPU and I/O (e.g. possibly via a third-party library function), which bucket would you put it in?

You can do it like this: you still have a single pool, some tasks carry a hint and the only rule is that you don't go over half of the pool for tasks that carry the same hint

mmomtchev · 2022-01-08T18:10:39Z

Besides, I learned about UV_THREADPOOL_SIZE when I started working on gdal-async - I am quite confident that this option remains virtually unknown outside of the core team. It is a very tricky option and increasing it beyond a certain value is counter-productive as it has a very negative effect on the CPU-locality - I have at least one benchmark in gdal-async that is fastest with UV_THREADPOOL_SIZE=1. But this should probably be discussed elsewhere.
@addaleax, @mscdex Do you think that reading the whole file in one operation should be opt-in or opt-out?

mscdex · 2022-01-08T20:44:08Z

I am quite confident that this option remains virtually unknown outside of the core team

It's already documented in the man page, the --help output, and in the API documentation. Additionally it's referenced in a handful of API documentation sections, including the documentation for fs, crypto, and dns.

If you think the current documentation is insufficient, submit a PR to improve it.

Do you think that reading the whole file in one operation should be opt-in or opt-out?

I think the behavior being changed in this PR should be opt-in instead of being forced.

Also, when the documentation is updated to include the new config option, it should include a warning about the effect it can have on the thread pool.

mawaregetsuka

Reading any file without a length limit is very risky so I don't think it should be the default behavior. If it's really necessary then use readFileSync and I think that's why we need to have both readFile and readFileSync, right?

mmomtchev · 2022-01-09T23:47:01Z

@mawaregetsuka both methods will read the whole file for as long as its size is inferior to a predefined limit as it has to fit in memory

mmomtchev · 2022-01-10T18:34:40Z

This has been extensively discussed before: #25741

mawaregetsuka · 2022-01-11T16:20:49Z

so maybe we can let kReadFileBufferLength become an parameter with default value instead of const?

ronag · 2022-01-12T08:57:59Z

lib/internal/fs/read_file_context.js

@@ -99,7 +97,7 @@ class ReadFileContext {
    } else {
      buffer = this.buffer;
      offset = this.pos;
-      length = MathMin(kReadFileBufferLength, this.size - this.pos);
+      length = this.size - this.pos;


Suggested change

length = this.size - this.pos;

length = this.signal ? MathMin(kReadFileBufferLength, this.size - this.pos) : this.size - this.pos;

@ronag I am not sure I understand the benefit of this? this.signal won't exist when starting the operation?
In fact, after reading the previous discussion I agree that the current implementation has some very important advantages that must be preserved.
One of them is that it is fair and not prone to starvation - because after this PR if you launch 5 readFile with the current 4 threads you won't start reading the fifth one until the first four are completed.

mmomtchev · 2022-01-12T13:16:56Z

so maybe we can let kReadFileBufferLength become an parameter with default value instead of const?

Maybe this is the best option, yes

mmomtchev · 2022-01-12T14:24:10Z

@mscdex @ronag @addaleax @mawaregetsuka I added an in-depth analysis of the performance loss.

I also discovered that CPU/FAST_IO/SLOW_IO hints are already supported by libuv but Node does not take advantage of this - all of its async work is scheduled as CPU work.

I agree that everything that can be done as a quick hack at this point is to render the chunk size configurable.

mmomtchev · 2022-01-13T10:34:46Z

I wonder if a quick hack, and one that won't make it in a production version before Node 18, is worth it.
My opinion is that the problem from the issue (not being able to queue work from the worker threads) is a very real performance and scalability problem - I am the author of two Node.js addons which use extensively asynchronous work and I have encountered it in both of them and this is the very reason I went for my own threads in ExprtTk.js, completely skipping uv_queue_work and relying only on threadsafe_function to call JS. This is what I was planning to do with gdal-async too.

Now I see that this problem is very real with Node's own I/O. However pushing this change requires a concentrated effort and coordination with the libuv team. I would really like to draw your attention to this, it is very significant.

mawaregetsuka · 2022-01-14T03:26:15Z

@mmomtchev I am glad to be working to solve this problem but I am getting less and less aware of what your goal is. function fs.readfile performance? Or something deeper?

mmomtchev · 2022-01-14T17:02:12Z

@mawaregetsuka If the underlying issue can be solved, better not push this PR through - if one adds an additional parameter to readFile it will have to stay in later versions
But given the underlying issue, I am afraid that it won't happen overnight

adamrights · 2022-01-21T06:06:18Z

Besides, I learned about UV_THREADPOOL_SIZE when I started working on gdal-async - I am quite confident that this option remains virtually unknown outside of the core team.

This I can attest is not true. We've been using node at Dowjones since 2011, and there are multiple services where we've benchmarked and tuned the threadpool size for the type of work/servers.

nodejs-github-bot added fs Issues and PRs related to the fs subsystem / file system. needs-ci PRs that need a full CI run. labels Jan 7, 2022

mmomtchev added 2 commits January 7, 2022 22:06

fs: readFile in one syscall to avoid context switching

e81e18a

This PR makes `readFile` as fast as `readFileSync` In fact, there is no point in reading the file in small chunks as the buffer is already allocated The `AbortController` does not justify such a huge performance hit

lint

ed9590f

mmomtchev force-pushed the readfile-buffer branch from 594fb7b to ed9590f Compare January 7, 2022 22:37

mscdex suggested changes Jan 8, 2022

View reviewed changes

mmomtchev changed the title ~~readFile in one syscall to avoid context switching~~ fs: readFile in one syscall to avoid context switching Jan 8, 2022

mawaregetsuka reviewed Jan 9, 2022

View reviewed changes

ronag requested changes Jan 12, 2022

View reviewed changes

mawaregetsuka mentioned this pull request Jan 22, 2022

fs: readFile function adds the chunkSize option #41647

Open

Trott force-pushed the main branch from 2d76238 to ca3ed36 Compare November 12, 2022 01:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fs: `readFile` in one syscall to avoid context switching #41436

fs: `readFile` in one syscall to avoid context switching #41436

mmomtchev commented Jan 7, 2022

aduh95 commented Jan 7, 2022 •

edited

Loading

mscdex commented Jan 8, 2022 •

edited

Loading

mmomtchev commented Jan 8, 2022

mmomtchev commented Jan 8, 2022 •

edited

Loading

mscdex left a comment •

edited

Loading

mmomtchev commented Jan 8, 2022

mmomtchev commented Jan 8, 2022 •

edited

Loading

mmomtchev commented Jan 8, 2022

mscdex commented Jan 8, 2022

mmomtchev commented Jan 8, 2022

mmomtchev commented Jan 8, 2022

mscdex commented Jan 8, 2022 •

edited

Loading

mawaregetsuka left a comment

mmomtchev commented Jan 9, 2022

mmomtchev commented Jan 10, 2022

mawaregetsuka commented Jan 11, 2022

ronag Jan 12, 2022

mmomtchev Jan 12, 2022 •

edited

Loading

mmomtchev commented Jan 12, 2022

mmomtchev commented Jan 12, 2022

mmomtchev commented Jan 13, 2022

mawaregetsuka commented Jan 14, 2022

mmomtchev commented Jan 14, 2022

adamrights commented Jan 21, 2022

	length = this.size - this.pos;
	length = this.signal ? MathMin(kReadFileBufferLength, this.size - this.pos) : this.size - this.pos;

fs: readFile in one syscall to avoid context switching #41436

Are you sure you want to change the base?

fs: readFile in one syscall to avoid context switching #41436

Conversation

mmomtchev commented Jan 7, 2022

aduh95 commented Jan 7, 2022 • edited Loading

mscdex commented Jan 8, 2022 • edited Loading

mmomtchev commented Jan 8, 2022

mmomtchev commented Jan 8, 2022 • edited Loading

mscdex left a comment • edited Loading

Choose a reason for hiding this comment

mmomtchev commented Jan 8, 2022

mmomtchev commented Jan 8, 2022 • edited Loading

mmomtchev commented Jan 8, 2022

mscdex commented Jan 8, 2022

mmomtchev commented Jan 8, 2022

mmomtchev commented Jan 8, 2022

mscdex commented Jan 8, 2022 • edited Loading

mawaregetsuka left a comment

Choose a reason for hiding this comment

mmomtchev commented Jan 9, 2022

mmomtchev commented Jan 10, 2022

mawaregetsuka commented Jan 11, 2022

ronag Jan 12, 2022

Choose a reason for hiding this comment

mmomtchev Jan 12, 2022 • edited Loading

Choose a reason for hiding this comment

mmomtchev commented Jan 12, 2022

mmomtchev commented Jan 12, 2022

mmomtchev commented Jan 13, 2022

mawaregetsuka commented Jan 14, 2022

mmomtchev commented Jan 14, 2022

adamrights commented Jan 21, 2022

fs: `readFile` in one syscall to avoid context switching #41436

fs: `readFile` in one syscall to avoid context switching #41436

aduh95 commented Jan 7, 2022 •

edited

Loading

mscdex commented Jan 8, 2022 •

edited

Loading

mmomtchev commented Jan 8, 2022 •

edited

Loading

mscdex left a comment •

edited

Loading

mmomtchev commented Jan 8, 2022 •

edited

Loading

mscdex commented Jan 8, 2022 •

edited

Loading

mmomtchev Jan 12, 2022 •

edited

Loading