Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the bug caused by fast api changes in v22.5.0 and have a post-mortem #53934

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

anonrig
Copy link
Member

@anonrig anonrig commented Jul 18, 2024

For those who are unfamiliar, a recent change of adding V8 Fast API to fs.closeSync broke almost all applications in v22.5.0. The goal of that PR was to significantly improve the performance of this function and speed up applications like NPM. Unfortunately, I broke it. For reference, here is the root cause of this incident: #53627

Problem

Here's the fix for the bug breaking v22.5.0. Before diving into the technical implementation, let me explain what the bugs were.

There were actually 2 bugs causing this issue.

The crash

The first issue was the crash with the error message: ERROR: v8::Object::GetCreationContextChecked No creation context available. This error was caused by lib/internal/fs/read/context.js. We were destructuring the result of internalBinding('fs') which caused the creation context of the function being unavailable.

Previously, it was:

const { close } = internalBinding('fs')
close(fd, req)

The fix was:

const binding = internalBinding('fs')
binding.close(fd, req)

Breaking NPM

The second issue was caused by the addition of V8 Fast API call for fs.closeSync(), at least that's what I thought I was adding the fast API for. Unfortunately, adding a fast API to Close function, even though the fast API had a function signature of FastClose(recv, uin32_t fd, options), it was still getting triggered for fs.close(fd, req).

Unfortunately, this has been the expectation from my side, and unfortunately, under optimization, causing V8 Fast API to trigger it and causing NPM to fail.

The error message was:

npm error Exit handler never called!
npm error This is an error with npm itself. Please report this error at:

Since FastClose didn't receive any function as a parameter, we weren't executing it. This resulted in fs.close(fd, cb) to be executed in the fast API context, and causing NPM to fail, since it was expecting the result of the close operation.

Moving Sync and non-sync version to 2 different C++ functions and adding fast API to sync version fixes this problem.

CITGM

Node.js has a project called Canary in the gold mine (CITGM) to catch unintended bugs that might affect the ecosystem, like this. Unfortunately, recently we discovered that even if there were failures our CI jobs were successful. This issue is currently investigated under nodejs/citgm#1067.

Solution

We need better testing for V8 Fast API implementations. There are some unintended consequences of these additions and unfortunately, it requires a more thorough pull-request review.

Thank you!

@wesleytodd
Copy link
Member

Related: nodejs/citgm#1067

@avivkeller avivkeller added fs Issues and PRs related to the fs subsystem / file system. v8 engine Issues and PRs related to the V8 dependency. v22.x v22.x Issues that can be reproduced on v22.x or PRs targeting the v22.x-staging branch. labels Jul 19, 2024
@avivkeller
Copy link
Member

avivkeller commented Jul 19, 2024

Additionally, if you are facing an issue with v22.5.0 as a result of the issues described, this is not the place to mention them.

Please refer to #53902 for tracking the fs crashing.

Please refer to npm/cli#7657 for tracking the npm errors.

Both of these issues were resolved via a revert landed in #53904. Please update to v22.5.1.

If you are experiencing an issue unrelated to the above, please visit nodejs/help.

@avivkeller avivkeller added the post-mortem Issues and PRs related to the post-mortem diagnostics of Node.js. label Jul 19, 2024
@nodejs-github-bot nodejs-github-bot added c++ Issues and PRs that require attention from people who are familiar with C++. lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. labels Jul 19, 2024
@anonrig anonrig force-pushed the add-fast-close branch 2 times, most recently from a85841e to 0c93582 Compare July 19, 2024 02:09
const uint32_t fd,
// NOLINTNEXTLINE(runtime/references) This is V8 api.
v8::FastApiCallbackOptions& options) {
Environment* env = Environment::GetCurrent(recv->GetCreationContextChecked());
Copy link
Member

@devsnek devsnek Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not 100% sure but i think you may need to declare a HandleScope using options.isolate prior to opening this context handle, otherwise its leaking into whatever ambient HandleScope is present up the stack. (also, Env is reused across all contexts in an Isolate right? if that's the case, maybe we can just store an env pointer on the isolate instead of reaching through the context? if i'm misremembering and this doesn't work please ignore me)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use the latest V8 version to leverage options.isolate. Node 22 uses 12.4.254.21-node.15. Without options.isolate change landed on v8, I don't think there's much we can do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Env is reused across all contexts in an Isolate right? if that's the case, maybe we can just store an env pointer on the isolate instead of reaching through the context?

That is not strictly maintained, though we try to keep it that way (e.g. see discussions in #47855).

I think we discussed using Isolate::SetData() somewhere else, though the issue is that there are only 4 slots, and it may be a problem for Node.js embedders that also are already using those slots.

Copy link
Member

@joyeecheung joyeecheung Jul 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re. handle scopes, you can already get the isolate from the context and create a handle scope from that. I am not sure if that's really an issue, though - we almost never create handle scopes in our callbacks, it seems fine to assume that fast APIs work in the same way? Maybe it's still good to do for the sake of code hygine..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we discussed using Isolate::SetData() somewhere else, though the issue is that there are only 4 slots, and it may be a problem for Node.js embedders that also are already using those slots.

We used to store IsolateData* there in the past: c3cd453

That's not quite Environment* (which, yeah, technically doesn't always need to be the same for a single Isolate* – if we ever wanted to productionize something like synchronous-worker, we could).

I agree that that's probably not ideal for embedders, and we should avoid re-entangling these bits if we can. If we need to pass something like an Environment* through, would something like a v8::External* or BaseObject* work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I am hitting the handle scope issue in #54384 so yes it is required - I used a fix with v8::Isolate::TryGetCurrent there since v22 is stuck with 12.4, on main which has 12.8 this can use the isolate from fallback options to create the handle scope.

Copy link
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@Abhijrathod

This comment has been minimized.

@nodejs-github-bot
Copy link
Collaborator

sir-gon pushed a commit to sir-gon/algorithm-exercises-js that referenced this pull request Jul 19, 2024
sir-gon pushed a commit to sir-gon/algorithm-exercises-ts that referenced this pull request Jul 19, 2024
@nodejs-github-bot

This comment was marked as outdated.

const common = require('../common');
const fs = require('fs');

// A large number is selected to ensure that the fast path is called.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ensured that 100_000 is enough?
Any chance to really verify that fast API is actually used?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can try something similar to https://chromium.googlesource.com/v8/v8/+/e3e8ea5d657b886201d4e1982589a72f77909ba4/test/mjsunit/compiler/fast-api-helpers.js#7 though it needs to be paired with specific V8 optimizer flags and can be subject to changes as V8 changes their optimizing strategy.

Copy link
Member

@devsnek devsnek Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mjsunit.js exists in the node tree, you should be able to load it into this test with --allow-natives-syntax and then you can use stuff like assertOptimized and it will continue to work as v8 is updated. You could also add counters or use the v8 cpu profiler to verify that the fast function was called.

FS_SYNC_TRACE_END(close);

if (is_uv_error(err)) {
options.fallback = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means retrying, correct? In that case, quoting https://man7.org/linux/man-pages/man2/close.2.html: (emphasis mine)

Dealing with error returns from close()
A careful programmer will check the return value of close(), since it is quite possible that errors on a previous write(2) operation are reported only on the final close() that releases the open file description. Failing to check the return value when closing a file may lead to silent loss of data. This can especially be observed with NFS and with disk quota.

Note, however, that a failure return should be used only for diagnostic purposes (i.e., a warning to the application that there may still be I/O pending or there may have been failed I/O) or remedial purposes (e.g., writing the file once more or creating a backup).

Retrying the close() after a failure return is the wrong thing to do, since this may cause a reused file descriptor from another thread to be closed. This can occur because the Linux kernel always releases the file descriptor early in the close operation, freeing it for reuse; the steps that may return an error, such as flushing data to the filesystem or device, occur only later in the close operation.

Many other implementations similarly always close the file descriptor (except in the case of EBADF, meaning that the file descriptor was invalid) even if they subsequently report an error on return from close(). POSIX.1 is currently silent on this point, but there are plans to mandate this behavior in the next major release of the standard.

A careful programmer who wants to know about I/O errors may precede close() with a call to fsync(2).

The EINTR error is a somewhat special case. Regarding the EINTR error, POSIX.1-2008 says:

If close() is interrupted by a signal that is to be caught, it shall return -1 with errno set to EINTR and the state of fildes is unspecified.

This permits the behavior that occurs on Linux and many other implementations, where, as with other errors that may be reported by close(), the file descriptor is guaranteed to be closed. However, it also permits another possibility: that the implementation returns an EINTR error and keeps the file descriptor open. (According to its documentation, HP-UX's close() does this.) The caller must then once more use close() to close the file descriptor, to avoid file descriptor leaks. This divergence in implementation behaviors provides a difficult hurdle for portable applications, since on many implementations, close() must not be called again after an EINTR error, and on at least one, close() must be called again. There are plans to address this conundrum for the next major release of the POSIX.1 standard.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, options.fallback = true will retry on the slow path.

Retrying the close() after a failure return is the wrong thing to do, since this may cause a reused file descriptor from another thread to be closed. This can occur because the Linux kernel always releases the file descriptor early in the close operation, freeing it for reuse; the steps that may return an error, such as flushing data to the filesystem or device, occur only later in the close operation.

Does this still apply since this is a IO blocking operation? If it does apply, does this make this pull-request invalid?

A careful programmer who wants to know about I/O errors may precede close() with a call to fsync(2).

This might be a breaking change isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, options.fallback = true will retry on the slow path.

Ok – we can't really do that.

Retrying the close() after a failure return is the wrong thing to do, since this may cause a reused file descriptor from another thread to be closed. This can occur because the Linux kernel always releases the file descriptor early in the close operation, freeing it for reuse; the steps that may return an error, such as flushing data to the filesystem or device, occur only later in the close operation.

Does this still apply since this is a IO blocking operation? If it does apply, does this make this pull-request invalid?

Not really – it just means we can't use a fallback path here.

A careful programmer who wants to know about I/O errors may precede close() with a call to fsync(2).

This might be a breaking change isn't it?

It's not a suggestion that I'd implement inside of our fs.close() calls. If a user wants this, they can call our own fsync() wrapper.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when isolate is available on FastApiCallbackOptions, you'll be able to raise an exception directly from the fast function. Maybe this should wait for that?

test/parallel/test-fs-close-fast-api.js Outdated Show resolved Hide resolved
src/node_file.cc Show resolved Hide resolved
@nodejs-github-bot

This comment has been minimized.

@anonrig anonrig added the blocked PRs that are blocked by other issues or PRs. label Jul 26, 2024
@nodejs-github-bot
Copy link
Collaborator

@RafaelGSS RafaelGSS removed the v22.x v22.x Issues that can be reproduced on v22.x or PRs targeting the v22.x-staging branch. label Jul 30, 2024
sgohlke added a commit to dreamit-de/workflows that referenced this pull request Aug 5, 2024
22.5.1 fixed the npm issues: nodejs/node#53934
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked PRs that are blocked by other issues or PRs. c++ Issues and PRs that require attention from people who are familiar with C++. fs Issues and PRs related to the fs subsystem / file system. lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. post-mortem Issues and PRs related to the post-mortem diagnostics of Node.js. v8 engine Issues and PRs related to the V8 dependency.
Projects
None yet
Development

Successfully merging this pull request may close these issues.