Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fs: readFile in one syscall to avoid context switching #41436

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

mmomtchev
Copy link
Contributor

This PR makes readFile as fast as readFileSync
In fact, there is no point in reading the file
in small chunks as the buffer is already allocated
The AbortController does not justify such a
huge performance hit
Refs: #41435

@nodejs-github-bot nodejs-github-bot added fs Issues and PRs related to the fs subsystem / file system. needs-ci PRs that need a full CI run. labels Jan 7, 2022
This PR makes `readFile` as fast as `readFileSync`
In fact, there is no point in reading the file
in small chunks as the buffer is already allocated
The `AbortController` does not justify such a
huge performance hit
@aduh95
Copy link
Contributor

aduh95 commented Jan 7, 2022

Benchmark CI: https://ci.nodejs.org/view/Node.js%20benchmark/job/benchmark-node-micro-benchmarks/1085/

                                                              confidence improvement accuracy (*)    (**)   (***)
fs/readfile.js concurrent=10 len=1024 duration=5                             -2.46 %       ±7.39%  ±9.83% ±12.81%
fs/readfile.js concurrent=10 len=16777216 duration=5                 ***     19.21 %       ±3.00%  ±4.02%  ±5.29%
fs/readfile.js concurrent=1 len=1024 duration=5                               0.58 %       ±3.15%  ±4.20%  ±5.47%
fs/readfile.js concurrent=1 len=16777216 duration=5                          16.31 %      ±17.07% ±22.74% ±29.66%
fs/readfile-partitioned.js concurrent=10 len=1024 dur=5                       0.46 %       ±6.34%  ±8.44% ±10.98%
fs/readfile-partitioned.js concurrent=10 len=16777216 dur=5          ***    -44.38 %       ±2.33%  ±3.13%  ±4.13%
fs/readfile-partitioned.js concurrent=1 len=1024 dur=5                       -0.69 %       ±9.08% ±12.09% ±15.74%
fs/readfile-partitioned.js concurrent=1 len=16777216 dur=5             *     -4.70 %       ±4.13%  ±5.49%  ±7.15%
fs/readfile-promises.js concurrent=10 len=1024 duration=5                    -0.71 %      ±11.85% ±15.77% ±20.53%
fs/readfile-promises.js concurrent=10 len=16777216 duration=5                 3.21 %       ±3.59%  ±4.78%  ±6.23%
fs/readfile-promises.js concurrent=1 len=1024 duration=5                      0.34 %       ±3.66%  ±4.87%  ±6.34%
fs/readfile-promises.js concurrent=1 len=16777216 duration=5                  6.85 %       ±9.65% ±12.84% ±16.71%

@mscdex
Copy link
Contributor

mscdex commented Jan 8, 2022

My guess is that it was done that way to be more fair to other operations in the thread pool.

EDIT: I think that's exactly what benchmark/fs/readfile-partitioned.js is testing and why it is seeing such a large regression.

@mmomtchev
Copy link
Contributor Author

EDIT: I think that's exactly what benchmark/fs/readfile-partitioned.js is testing and why it is seeing such a large regression.

benchmark/fs/readfile-partitioned.js does both reading and compression (zlib deflate) in parallel - and both are completely independent - and reports the sum of the number of reads and compressed blocks - so if you shift the balance towards reads you get a skewed results because they are not 1:1

There is no reason for this test to be slower - it just reduces the number of context switches meaning threads tend to stay longer on the CPU - if the CPU has enough cores to accommodate all libuv threads.

@mmomtchev
Copy link
Contributor Author

mmomtchev commented Jan 8, 2022

One of the most viewed Node.js questions without an answer

https://stackoverflow.com/questions/52648229/i-o-performance-in-node-js-worker-threads

Copy link
Contributor

@mscdex mscdex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both are completely independent

They both use the thread pool, so the tasks are competing for available threads.

As I said, that particular benchmark configuration is showing that large(r) reads can block other tasks in the thread pool, which is why the regression shows up because fewer total tasks are being completed.

I think the behavior being changed in this PR should be opt-in instead of being forced.

Additionally, regarding your answer on that StackOverflow question: there is no dynamic thread creation happening. It's a thread pool, so threads are created once at startup and then are reused during the life of the process.

@mmomtchev
Copy link
Contributor Author

@mscdex I was referring to the question I was answered - in which readFileSync is faster than readFile including the creation of the worker thread (V8 isolate) - not to the thread pool

@mmomtchev
Copy link
Contributor Author

mmomtchev commented Jan 8, 2022

@addaleax @mscdex Isn't this the perfect moment to discuss increasing the default thread size from 4? It is a value inherited from another age when high-end CPUs rarely had more than 4 cores and 2MB was a lot of memory.
This is a general problem of one thread doing heavy I/O and another thread doing only CPU work (compression in this case) - normally these two threads could perfectly co-exist on a single core - but here we are hitting the thread pool limit.
When one is reading in small chunks, leaving the compression more CPU time, it is the OS readahead that comes into play - this is why this test is so heavily impacted. In this case, there is an additional kernel thread that does work.

@mmomtchev mmomtchev changed the title readFile in one syscall to avoid context switching fs: readFile in one syscall to avoid context switching Jan 8, 2022
@mmomtchev
Copy link
Contributor Author

How about (it goes far beyond the scope of this PR, it just an idea) having half of the threads marked for CPU work and half of the threads marked for IO work, and when scheduling async work one is to specify the type of load - if there are enough threads to cover all cores for every type of work this arrangement would guarantee maximum performance in all cases. It is a huge change, I know, but it would bring an improvement all across the board.

@mscdex
Copy link
Contributor

mscdex commented Jan 8, 2022

Isn't this the perfect moment to discuss increasing the default thread size from 4?

I think 4 is still a reasonable default. Besides, users can already change the threadpool size by setting the UV_THREADPOOL_SIZE environment variable.

How about (it goes far beyond the scope of this PR, it just an idea) having half of the threads marked for CPU work and half of the threads marked for IO work

I don't think that's going to be doable since addons can use the same threadpool and you would need to rely on them to do the right thing. Also, what if you have a task that uses both CPU and I/O (e.g. possibly via a third-party library function), which bucket would you put it in?

@mmomtchev
Copy link
Contributor Author

I don't think that's going to be doable since addons can use the same threadpool and you would need to rely on them to do the right thing. Also, what if you have a task that uses both CPU and I/O (e.g. possibly via a third-party library function), which bucket would you put it in?

You can do it like this: you still have a single pool, some tasks carry a hint and the only rule is that you don't go over half of the pool for tasks that carry the same hint

@mmomtchev
Copy link
Contributor Author

Besides, I learned about UV_THREADPOOL_SIZE when I started working on gdal-async - I am quite confident that this option remains virtually unknown outside of the core team. It is a very tricky option and increasing it beyond a certain value is counter-productive as it has a very negative effect on the CPU-locality - I have at least one benchmark in gdal-async that is fastest with UV_THREADPOOL_SIZE=1. But this should probably be discussed elsewhere.
@addaleax, @mscdex Do you think that reading the whole file in one operation should be opt-in or opt-out?

@mscdex
Copy link
Contributor

mscdex commented Jan 8, 2022

I am quite confident that this option remains virtually unknown outside of the core team

It's already documented in the man page, the --help output, and in the API documentation. Additionally it's referenced in a handful of API documentation sections, including the documentation for fs, crypto, and dns.

If you think the current documentation is insufficient, submit a PR to improve it.

Do you think that reading the whole file in one operation should be opt-in or opt-out?

I think the behavior being changed in this PR should be opt-in instead of being forced.

Also, when the documentation is updated to include the new config option, it should include a warning about the effect it can have on the thread pool.

Copy link
Contributor

@mawaregetsuka mawaregetsuka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading any file without a length limit is very risky so I don't think it should be the default behavior. If it's really necessary then use readFileSync and I think that's why we need to have both readFile and readFileSync, right?

@mmomtchev
Copy link
Contributor Author

@mawaregetsuka both methods will read the whole file for as long as its size is inferior to a predefined limit as it has to fit in memory

@mmomtchev
Copy link
Contributor Author

This has been extensively discussed before: #25741

@mawaregetsuka
Copy link
Contributor

so maybe we can let kReadFileBufferLength become an parameter with default value instead of const?

@@ -99,7 +97,7 @@ class ReadFileContext {
} else {
buffer = this.buffer;
offset = this.pos;
length = MathMin(kReadFileBufferLength, this.size - this.pos);
length = this.size - this.pos;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
length = this.size - this.pos;
length = this.signal ? MathMin(kReadFileBufferLength, this.size - this.pos) : this.size - this.pos;

Copy link
Contributor Author

@mmomtchev mmomtchev Jan 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ronag I am not sure I understand the benefit of this? this.signal won't exist when starting the operation?
In fact, after reading the previous discussion I agree that the current implementation has some very important advantages that must be preserved.
One of them is that it is fair and not prone to starvation - because after this PR if you launch 5 readFile with the current 4 threads you won't start reading the fifth one until the first four are completed.

@mmomtchev
Copy link
Contributor Author

so maybe we can let kReadFileBufferLength become an parameter with default value instead of const?

Maybe this is the best option, yes

@mmomtchev
Copy link
Contributor Author

@mscdex @ronag @addaleax @mawaregetsuka I added an in-depth analysis of the performance loss.

I also discovered that CPU/FAST_IO/SLOW_IO hints are already supported by libuv but Node does not take advantage of this - all of its async work is scheduled as CPU work.

I agree that everything that can be done as a quick hack at this point is to render the chunk size configurable.

@mmomtchev
Copy link
Contributor Author

I wonder if a quick hack, and one that won't make it in a production version before Node 18, is worth it.
My opinion is that the problem from the issue (not being able to queue work from the worker threads) is a very real performance and scalability problem - I am the author of two Node.js addons which use extensively asynchronous work and I have encountered it in both of them and this is the very reason I went for my own threads in ExprtTk.js, completely skipping uv_queue_work and relying only on threadsafe_function to call JS. This is what I was planning to do with gdal-async too.

Now I see that this problem is very real with Node's own I/O. However pushing this change requires a concentrated effort and coordination with the libuv team. I would really like to draw your attention to this, it is very significant.

@mawaregetsuka
Copy link
Contributor

@mmomtchev I am glad to be working to solve this problem but I am getting less and less aware of what your goal is. function fs.readfile performance? Or something deeper?

@mmomtchev
Copy link
Contributor Author

@mawaregetsuka If the underlying issue can be solved, better not push this PR through - if one adds an additional parameter to readFile it will have to stay in later versions
But given the underlying issue, I am afraid that it won't happen overnight

@adamrights
Copy link

Besides, I learned about UV_THREADPOOL_SIZE when I started working on gdal-async - I am quite confident that this option remains virtually unknown outside of the core team.

This I can attest is not true. We've been using node at Dowjones since 2011, and there are multiple services where we've benchmarked and tuned the threadpool size for the type of work/servers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fs Issues and PRs related to the fs subsystem / file system. needs-ci PRs that need a full CI run.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants