Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Error Handling for Graceful Server Shutdown during Keepalive Ping #2624

Closed
w1am opened this issue Nov 28, 2023 · 12 comments
Closed

Comments

@w1am
Copy link

w1am commented Nov 28, 2023

Problem Description

I am experiencing a discrepancy in the error messages returned when reading from a gRPC stream and gracefully shutting down the connection midstream. Specifically, the grpc-js client incorrectly interprets a cancelled keepalive ping as an "Unavailable" error instead of the expected "Cancelled" error during such a shutdown. This behavior is related to my PR #2622. While the client logs Error [ERR_HTTP2_PING_CANCEL]: HTTP2 ping cancelled, it eventually throws an "Unavailable" error, which seems inconsistent with the actual server state.

Reproduction Steps

  1. Clone my repository.
  2. Run yarn to install dependencies.
  3. Execute ./server.sh to start the server.
  4. In a separate terminal, run node client.js to start the client.
  5. Wait for the client to establish a connection and start reading from the stream.
  6. Gracefully stop the server.
  7. Observe the logs to see the error messages.

Relevant code snippet handling ping failures:

transport.ts:429

this.session!.ping((err: Error | null, duration: number, payload: Buffer) => {
  if (err) {
    console.log(err); // Outputs: Error [ERR_HTTP2_PING_CANCEL]: HTTP2 ping cancelled
    this.keepaliveTrace('Ping failed with error ' + err.message);
    this.handleDisconnect();
  }
  this.keepaliveTrace('Received ping response');
  this.clearKeepaliveTimeout();
  this.maybeStartKeepalivePingTimer();
});

The log displays Error [ERR_HTTP2_PING_CANCEL]: HTTP2 ping cancelled, indicating a correct cancellation due to server shutdown. However, the final error reported to the client is "Unavailable".

Full log:

Event #0: test-event
Event #1: test-event
Event #2: test-event
Event #3: test-event
Event #4: test-event
Event #5: test-event
Event #6: test-event
Event #7: test-event
Event #8: test-event
Event #9: test-event
Event #10: test-event
Event #11: test-event
Manually stop the Docker container now...
!goaway called!
Error [ERR_HTTP2_PING_CANCEL]: HTTP2 ping cancelled
    at new NodeError (internal/errors.js:322:7)
    at pingCallback (internal/http2/core.js:897:10)
    at processTicksAndRejections (internal/process/task_queues.js:83:21) {
  code: 'ERR_HTTP2_PING_CANCEL'
}
!close called!
Event #12: test-event
Event #13: test-event
Event #14: test-event
Event #15: test-event
Event #16: test-event
Event #17: test-event
Event #18: test-event
Event #19: test-event
Event #20: test-event
Event #21: test-event
Event #22: test-event
Event #23: test-event
Event #24: test-event
Event #25: test-event
Event #26: test-event
Event #27: test-event
Event #28: test-event
Event #29: test-event
Event #30: test-event
Event #31: test-event
Event #32: test-event
Event #33: test-event
Event #34: test-event
Event #35: test-event
Event #36: test-event
Event #37: test-event
Event #38: test-event
Event #39: test-event
Event #40: test-event
Event #41: test-event
Event #42: test-event
Event #43: test-event
Event #44: test-event
Error: UnavailableError: 14 UNAVAILABLE: Connection dropped

Environment

  • OS: Linux Fedora 39
  • Node version: 14.21.3
  • Node installation method: t/j n
  • Package: @grpc/[email protected]

Additional Context

N/A

@murgatroid99
Copy link
Member

This ping cancellation you are observing is a session-level event, and a description of how the ping specifically ended. It is not supposed to have any direct correspondence to how an individual request on that session ends. We interpret any ping error we see to mean that the server is no longer available, so the UNAVAILABLE error code is appropriate.

When you say that you "Gracefully stop the server", it's not clear what that means. In the context of gRPC, that would generally involve allowing existing requests to end on their own while not accepting new requests, but that is clearly not what is happening here.

If you want to have more control over the status code the client sees, you should control that explicitly on the server. Most gRPC server implementations have a function that cancels all open requests (generally while shutting down). In grpc-js, that is Server#forceShutdown.

@w1am
Copy link
Author

w1am commented Nov 28, 2023

@murgatroid99 Thanks for the explanation! I'm running into a strange situation where I get "Unavailable" when connecting to my server in CI and "Cancelled" in the 'Latest' server environment.

When you say that you "Gracefully stop the server", it's not clear what that means

By graceful shutdown, I mean stopping the docker-compose (my server) using Ctrl + C.

However, previous versions (<=1.9.1) of gRPC-Js before this modification in transport.ts was added, I was getting the same error message (Cancelled) across both environments (CI and Latest). I would tend to agree that 'Unavailable' should be the right message. However, I still cannot figure out why I get different error messages for different server environments after upgrading gRPC-js.

@murgatroid99
Copy link
Member

Well, what is different between your CI and Latest server environments?

@w1am
Copy link
Author

w1am commented Nov 28, 2023

@murgatroid99 They are the same server but running in different environment.

@murgatroid99
Copy link
Member

The exact error you are seeing is directly related to how you are running the server (in a docker container) and how you are stopping it (sending SIGINT to docker-compose). If you are not doing those same things in both environments, it is unsurprising that you would see different errors, and even if you are, identical handling is not guaranteed on the client.

The expectation in gRPC is that a server gracefully shuts down by closing the listening port, sending each open session a GOAWAY with code 0, and then waiting for existing requests to end on their own before ending the process. These actions are generally encapsulated in a single operation in the gRPC API, like Server#tryShutdown in grpc-js. A server can also shutdown more aggressively by cancelling all pending requests after sending the GOAWAY.

A client will make its best effort to handle any other event that causes a connection to become unusable in the middle of processing a request, but specific outcomes are not guaranteed.

@w1am
Copy link
Author

w1am commented Nov 29, 2023

In my case, receiving a GOAWAY with code 0 is unexpectedly triggering handleDisconnect. This seems contrary to the expected behavior, where handleDisconnect should only respond to abrupt disconnects without a GOAWAY or with non-zero GOAWAY codes.

@murgatroid99
Copy link
Member

In your log, handleDisconnect is triggered by the ping error, not by the GOAWAY. If the server sends a GOAWAY with code 0, indicating that it is shutting down gracefully, and then immediately stops responding while there are still pending requests, it is not actually shutting down gracefully, so the client is back to making its best effort to handle the error.

@w1am
Copy link
Author

w1am commented Dec 4, 2023

I ran the test in another scenario to better illustrate the problem:

  1. I read data from a stream every second.
  2. I closed the connection as soon as the client connected to the server (at t=1s)

In this test, I set the keepAliveInterval to 10s and the keepAliveTimeout to 10s.

In server version 23.10.0:
After shutting down the server at t=1s, the session was closed at t=6s.

At t=10s, since the session had already closed, it returned CANCELLED to the client because the keepalive ping could not be sent.

In server version 24.0.0:
After shutting down the server at t=1s, the session was closed at t=10s.

The client managed to send the keepalive ping to the server before the server closed, and it returned UNAVAILABLE (by the ping error handler, i.e handleDisconnect) to the client.

I realized that the difference between the two servers is that in version 24.0.0, they upgraded to .NET 7, but nothing else changed. The changelog for .NET 7 can be found here.

Conclusion

For both server, the server was shut down at the same time (t=1s) and different error message is being returned because In version 23.10.0, the session closed prematurely at t=6s, making it impossible for the client to send the keepalive ping, resulting in a CANCELLED error. In contrast, in version 24.0.0, with the .NET 7 upgrade, the session remained open until t=10s. This allowed the client enough time to send the keepalive ping before the server shutdown, leading to an UNAVAILABLE error instead.

So, it appears that the discrepancy might have been introduced by the .NET 7 upgrade after all, but nothing obvious is indicated in the changelog.

@murgatroid99
Copy link
Member

The Node client does not currently end a call with the CANCELLED status because the keepalive ping could not be sent. The PR you sent would have introduced that behavior, but I didn't merge that PR. Either the CANCELLED status is coming from a different code path, or you are using your branch with your proposed change.

@w1am
Copy link
Author

w1am commented Dec 5, 2023

It seems that in .NET 7, the default value for the graceful ShutdownTimeout has changed, which affects how much time the server allows for ongoing requests to complete gracefully before forcefully terminating them during a shutdown process but it wasn't listed in the changelog. I am closing this issue.

@w1am w1am closed this as completed Dec 5, 2023
@manish-sharma-resolver
Copy link

I was wondering if this is the same issue that i am facing. I want to use tryShutdown() to gracefully process inflight requests or pending request that came in before SIGTERM or SIGINT and still processing but when the client side receives, it gets an error saying "Connection dropped" code 14. If it was just the server itself that we are concerned about than clent would anyhow wont be able to process anyhing.

I was wondering if we could create a functionality where we would not kill the connection abrupty but give a chance to the server to keep the connection alive for a while after the process has been done.

Like how hapi does it https://hapi.dev/api/?v=21.3.3#-await-serverstopoptions where it provides some timeout where we can process it within that time frame, else it will froce shutdown. Then we give the client some power to process the requests as well.

like
`tryShutdown({ timeout: 10000 }, () => {

/// callback called after that 10000 ms pause after the pending request processed. This way we will only close the connection in 10seconds after the pending requests have been processed.
})`

@murgatroid99
Copy link
Member

@manish-sharma-resolver To be clear, the functionality you want is how the function is supposed to work. I don't recognize that particular error in this situation, so it might make sense to file a separate issue about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants