-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix the bug caused by fast api changes in v22.5.0 and have a post-mortem #53934
base: main
Are you sure you want to change the base?
Conversation
Related: nodejs/citgm#1067 |
Additionally, if you are facing an issue with v22.5.0 as a result of the issues described, this is not the place to mention them. Please refer to #53902 for tracking the Please refer to npm/cli#7657 for tracking the Both of these issues were resolved via a revert landed in #53904. Please update to v22.5.1. If you are experiencing an issue unrelated to the above, please visit nodejs/help. |
a85841e
to
0c93582
Compare
const uint32_t fd, | ||
// NOLINTNEXTLINE(runtime/references) This is V8 api. | ||
v8::FastApiCallbackOptions& options) { | ||
Environment* env = Environment::GetCurrent(recv->GetCreationContextChecked()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not 100% sure but i think you may need to declare a HandleScope using options.isolate
prior to opening this context handle, otherwise its leaking into whatever ambient HandleScope is present up the stack. (also, Env is reused across all contexts in an Isolate right? if that's the case, maybe we can just store an env pointer on the isolate instead of reaching through the context? if i'm misremembering and this doesn't work please ignore me)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use the latest V8 version to leverage options.isolate
. Node 22 uses 12.4.254.21-node.15
. Without options.isolate
change landed on v8, I don't think there's much we can do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Env is reused across all contexts in an Isolate right? if that's the case, maybe we can just store an env pointer on the isolate instead of reaching through the context?
That is not strictly maintained, though we try to keep it that way (e.g. see discussions in #47855).
I think we discussed using Isolate::SetData()
somewhere else, though the issue is that there are only 4 slots, and it may be a problem for Node.js embedders that also are already using those slots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re. handle scopes, you can already get the isolate from the context and create a handle scope from that. I am not sure if that's really an issue, though - we almost never create handle scopes in our callbacks, it seems fine to assume that fast APIs work in the same way? Maybe it's still good to do for the sake of code hygine..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we discussed using
Isolate::SetData()
somewhere else, though the issue is that there are only 4 slots, and it may be a problem for Node.js embedders that also are already using those slots.
We used to store IsolateData*
there in the past: c3cd453
That's not quite Environment*
(which, yeah, technically doesn't always need to be the same for a single Isolate*
– if we ever wanted to productionize something like synchronous-worker, we could).
I agree that that's probably not ideal for embedders, and we should avoid re-entangling these bits if we can. If we need to pass something like an Environment*
through, would something like a v8::External*
or BaseObject*
work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I am hitting the handle scope issue in #54384 so yes it is required - I used a fix with v8::Isolate::TryGetCurrent there since v22 is stuck with 12.4, on main which has 12.8 this can use the isolate from fallback options to create the handle scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
This comment has been minimized.
This comment has been minimized.
This comment was marked as outdated.
This comment was marked as outdated.
const common = require('../common'); | ||
const fs = require('fs'); | ||
|
||
// A large number is selected to ensure that the fast path is called. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ensured that 100_000 is enough?
Any chance to really verify that fast API is actually used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can try something similar to https://chromium.googlesource.com/v8/v8/+/e3e8ea5d657b886201d4e1982589a72f77909ba4/test/mjsunit/compiler/fast-api-helpers.js#7 though it needs to be paired with specific V8 optimizer flags and can be subject to changes as V8 changes their optimizing strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mjsunit.js exists in the node tree, you should be able to load it into this test with --allow-natives-syntax
and then you can use stuff like assertOptimized
and it will continue to work as v8 is updated. You could also add counters or use the v8 cpu profiler to verify that the fast function was called.
FS_SYNC_TRACE_END(close); | ||
|
||
if (is_uv_error(err)) { | ||
options.fallback = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means retrying, correct? In that case, quoting https://man7.org/linux/man-pages/man2/close.2.html: (emphasis mine)
Dealing with error returns from close()
A careful programmer will check the return value of close(), since it is quite possible that errors on a previous write(2) operation are reported only on the final close() that releases the open file description. Failing to check the return value when closing a file may lead to silent loss of data. This can especially be observed with NFS and with disk quota.Note, however, that a failure return should be used only for diagnostic purposes (i.e., a warning to the application that there may still be I/O pending or there may have been failed I/O) or remedial purposes (e.g., writing the file once more or creating a backup).
Retrying the close() after a failure return is the wrong thing to do, since this may cause a reused file descriptor from another thread to be closed. This can occur because the Linux kernel always releases the file descriptor early in the close operation, freeing it for reuse; the steps that may return an error, such as flushing data to the filesystem or device, occur only later in the close operation.
Many other implementations similarly always close the file descriptor (except in the case of EBADF, meaning that the file descriptor was invalid) even if they subsequently report an error on return from close(). POSIX.1 is currently silent on this point, but there are plans to mandate this behavior in the next major release of the standard.
A careful programmer who wants to know about I/O errors may precede close() with a call to fsync(2).
The EINTR error is a somewhat special case. Regarding the EINTR error, POSIX.1-2008 says:
If close() is interrupted by a signal that is to be caught, it shall return -1 with errno set to EINTR and the state of fildes is unspecified.
This permits the behavior that occurs on Linux and many other implementations, where, as with other errors that may be reported by close(), the file descriptor is guaranteed to be closed. However, it also permits another possibility: that the implementation returns an EINTR error and keeps the file descriptor open. (According to its documentation, HP-UX's close() does this.) The caller must then once more use close() to close the file descriptor, to avoid file descriptor leaks. This divergence in implementation behaviors provides a difficult hurdle for portable applications, since on many implementations, close() must not be called again after an EINTR error, and on at least one, close() must be called again. There are plans to address this conundrum for the next major release of the POSIX.1 standard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, options.fallback = true
will retry on the slow path.
Retrying the close() after a failure return is the wrong thing to do, since this may cause a reused file descriptor from another thread to be closed. This can occur because the Linux kernel always releases the file descriptor early in the close operation, freeing it for reuse; the steps that may return an error, such as flushing data to the filesystem or device, occur only later in the close operation.
Does this still apply since this is a IO blocking operation? If it does apply, does this make this pull-request invalid?
A careful programmer who wants to know about I/O errors may precede close() with a call to fsync(2).
This might be a breaking change isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes,
options.fallback = true
will retry on the slow path.
Ok – we can't really do that.
Retrying the close() after a failure return is the wrong thing to do, since this may cause a reused file descriptor from another thread to be closed. This can occur because the Linux kernel always releases the file descriptor early in the close operation, freeing it for reuse; the steps that may return an error, such as flushing data to the filesystem or device, occur only later in the close operation.
Does this still apply since this is a IO blocking operation? If it does apply, does this make this pull-request invalid?
Not really – it just means we can't use a fallback path here.
A careful programmer who wants to know about I/O errors may precede close() with a call to fsync(2).
This might be a breaking change isn't it?
It's not a suggestion that I'd implement inside of our fs.close()
calls. If a user wants this, they can call our own fsync()
wrapper.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when isolate
is available on FastApiCallbackOptions
, you'll be able to raise an exception directly from the fast function. Maybe this should wait for that?
This comment has been minimized.
This comment has been minimized.
22.5.1 fixed the npm issues: nodejs/node#53934
For those who are unfamiliar, a recent change of adding V8 Fast API to
fs.closeSync
broke almost all applications in v22.5.0. The goal of that PR was to significantly improve the performance of this function and speed up applications like NPM. Unfortunately, I broke it. For reference, here is the root cause of this incident: #53627Problem
Here's the fix for the bug breaking v22.5.0. Before diving into the technical implementation, let me explain what the bugs were.
There were actually 2 bugs causing this issue.
The crash
The first issue was the crash with the error message:
ERROR: v8::Object::GetCreationContextChecked No creation context available
. This error was caused bylib/internal/fs/read/context.js
. We were destructuring the result ofinternalBinding('fs')
which caused the creation context of the function being unavailable.Previously, it was:
The fix was:
Breaking NPM
The second issue was caused by the addition of V8 Fast API call for
fs.closeSync()
, at least that's what I thought I was adding the fast API for. Unfortunately, adding a fast API toClose
function, even though the fast API had a function signature ofFastClose(recv, uin32_t fd, options)
, it was still getting triggered forfs.close(fd, req)
.Unfortunately, this has been the expectation from my side, and unfortunately, under optimization, causing V8 Fast API to trigger it and causing NPM to fail.
The error message was:
Since FastClose didn't receive any function as a parameter, we weren't executing it. This resulted in
fs.close(fd, cb)
to be executed in the fast API context, and causing NPM to fail, since it was expecting the result of the close operation.Moving Sync and non-sync version to 2 different C++ functions and adding fast API to sync version fixes this problem.
CITGM
Node.js has a project called Canary in the gold mine (CITGM) to catch unintended bugs that might affect the ecosystem, like this. Unfortunately, recently we discovered that even if there were failures our CI jobs were successful. This issue is currently investigated under nodejs/citgm#1067.
Solution
We need better testing for V8 Fast API implementations. There are some unintended consequences of these additions and unfortunately, it requires a more thorough pull-request review.
Thank you!