Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Stop iteratoring cancelled grpc request streams (#23865) #27951

Merged

Conversation

rueian
Copy link
Contributor

@rueian rueian commented Aug 17, 2022

Why are these changes needed?

This fixes the below grpc error mentioned in #23865.

grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception iterating requests!"
        debug_error_string = "None"
>

This error happens when proxying the grpc stream to the individual SpecificServer but the incoming grpc stream is already canceled. (https://github.com/grpc/grpc/blob/v1.43.0/src/python/grpcio/grpc/_server.py#L353-L354)

Therefore, just wrap the request_iterator with a new RequestIteratorProxy which will catch the _CANCELLED signal.

Related issue number

#23865

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, the fix looks reasonable--could you please add a test for this behavior?

@architkulkarni architkulkarni added the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Aug 17, 2022
@rueian
Copy link
Contributor Author

rueian commented Aug 17, 2022

Sure, I will add it tomorrow.

@rueian rueian force-pushed the stop-proxy-cancelled-grpc-stream branch from d0016f1 to 24f6c20 Compare August 20, 2022 05:50
@rueian
Copy link
Contributor Author

rueian commented Aug 20, 2022

Hi @architkulkarni,

Sorry for the delay, I have updated the test case.

@rueian rueian requested a review from architkulkarni August 22, 2022 10:40
Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, but this looks LGTM!

@ijrsvt ijrsvt removed the @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. label Aug 31, 2022
@ckw017
Copy link
Member

ckw017 commented Sep 1, 2022

@rueian I tried running through your script in #23865 (go proxy on localhost:8888) and wasn't able to reproduce the issue, so I might be missing some context. Does this change fix your original problem with the proxy or does it just suppress the error message? Expected behavior would be that script exits gracefully after printing out [10000000, 10000000, 10000000, 10000000, 10000000, 10000000, 10000000, 10000000, 10000000, 10000000]

try:
return next(self.request_iterator)
except grpc.RpcError:
# stop iteration if the stream is CANCLLED
Copy link
Member

@ckw017 ckw017 Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add logger.exception("Iterator failed with the following exception:") here, that way the cause of error can be logged somewhere instead of being silently discarded.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ckw017, Instead of using logger.exception, I use logger.warning instead. Since logger.exception is not very helpful in the case of grpc.RpcError. Please see the below screenshot.

@rueian
Copy link
Contributor Author

rueian commented Sep 1, 2022

Expected behavior would be that script exits gracefully after printing out [10000000, 10000000, 10000000, 10000000, 10000000, 10000000, 10000000, 10000000, 10000000, 10000000]

Hi @ckw017,

Without this change, the script will still print out the expected result anyway but not quite gracefully. There will be also a Log channel is reconnecting. Logs produced while the connection was down can be found on the head node of the cluster in ray_client_server_[port].out message printed:
image

and there will be Proxying failed stack trace in the log file:
image

This change actually translates grpc.RpcError into StopIteration to stop proxying the request stream gracefully.

I agree that we shouldn't silently discard the grpc.RpcError, but if we use logger.exception(), it will not be very helpful and look like this:
image

and since the grpc.Error itself is empty (https://github.com/grpc/grpc/blob/v1.43.0/src/python/grpcio/grpc/__init__.py#L302-L303), I think it is ok that we just print a warning: logger.warning("Stop iterating request stream because it is cancelled").

@rueian rueian force-pushed the stop-proxy-cancelled-grpc-stream branch from 24f6c20 to c2d21ce Compare September 1, 2022 14:05
@ckw017
Copy link
Member

ckw017 commented Sep 1, 2022

and since the grpc.Error itself is empty (https://github.com/grpc/grpc/blob/v1.43.0/src/python/grpcio/grpc/__init__.py#L302-L303), I think it is ok that we just print a warning: logger.warning("Stop iterating request stream because it is cancelled").

I agree that grpc.Error isn't that useful on its own, but the iterator will also catch anything that subclasses grpc.Error which might have useful information. For example, it could catch this or this.

Out of curiosity, does this get rid of the Log channel is reconnecting. message as well? If it does, can you try running your script with debug level logging on so that we can get a better idea of what's actually happening? You can do that with something like:

import logging
logging.basicConfig(level=logging.DEBUG)

import ray
ray.init(..., logging_level=logging.DEBUG)

@rueian rueian force-pushed the stop-proxy-cancelled-grpc-stream branch 2 times, most recently from 19df9d3 to e0e16b9 Compare September 2, 2022 13:20
@rueian
Copy link
Contributor Author

rueian commented Sep 2, 2022

import logging
logging.basicConfig(level=logging.DEBUG)

import ray
ray.init(..., logging_level=logging.DEBUG)

Hi @ckw017, I have added, but unfortunately, the log was not really helpful:

DEBUG:ray:[ray] Forcing OMP_NUM_THREADS=1 to avoid performance degradation with many workers (issue #6998). You can override this by explicitly setting OMP_NUM_THREADS.
2022-09-02 21:24:37,115	INFO client_builder.py:247 -- Passing the following kwargs to ray.init() on the server: logging_level
2022-09-02 21:24:37,117	DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.IDLE
2022-09-02 21:24:37,322	DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.CONNECTING
2022-09-02 21:24:37,323	DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.READY
2022-09-02 21:24:38,334	DEBUG worker.py:796 -- Pinging server.
Log channel is reconnecting. Logs produced while the connection was down can be found on the head node of the cluster in `ray_client_server_[port].out`
2022-09-02 21:24:38,337	DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.CONNECTING
2022-09-02 21:24:38,338	DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.READY
2022-09-02 21:24:38,948	DEBUG worker.py:629 -- Retaining 00b119b77d8c0eba44c665e31bc59911cf7f9a0d0100000001000000
2022-09-02 21:24:38,948	DEBUG worker.py:553 -- Scheduling task get_dashboard_url 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x01\x00\x00\x00'
2022-09-02 21:24:38,950	DEBUG worker.py:629 -- Retaining a0972fac8691ffaaffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:39,292	DEBUG worker.py:625 -- Releasing a0972fac8691ffaaffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,327	DEBUG worker.py:629 -- Retaining 00b119b77d8c0eba44c665e31bc59911cf7f9a0d0100000002000000
2022-09-02 21:24:42,352	DEBUG worker.py:629 -- Retaining 00b119b77d8c0eba44c665e31bc59911cf7f9a0d0100000003000000
2022-09-02 21:24:42,352	DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,352	DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,352	DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,352	DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,353	DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,353	DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,353	DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,354	DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,354	DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,354	DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,355	DEBUG worker.py:629 -- Retaining d9811d3c1d420ce1ffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,356	DEBUG worker.py:629 -- Retaining bcacbd0d9fe6d0caffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,356	DEBUG worker.py:629 -- Retaining 0296b6a385a0ebabffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,356	DEBUG worker.py:629 -- Retaining ab5085433c874bffffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,357	DEBUG worker.py:629 -- Retaining bf43bb834add5fbeffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,357	DEBUG worker.py:629 -- Retaining d7465eca34a71d1bffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,357	DEBUG worker.py:629 -- Retaining 7f4d261ddb36ab9bffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,357	DEBUG worker.py:629 -- Retaining 8d9e2c5883d6d8f9ffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,358	DEBUG worker.py:629 -- Retaining a3044293ddc3bb69ffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,358	DEBUG worker.py:629 -- Retaining e1f20a4cf4920a9affffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866	DEBUG worker.py:625 -- Releasing e1f20a4cf4920a9affffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866	DEBUG worker.py:625 -- Releasing a3044293ddc3bb69ffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866	DEBUG worker.py:625 -- Releasing 8d9e2c5883d6d8f9ffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866	DEBUG worker.py:625 -- Releasing 7f4d261ddb36ab9bffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866	DEBUG worker.py:625 -- Releasing d7465eca34a71d1bffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866	DEBUG worker.py:625 -- Releasing bf43bb834add5fbeffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866	DEBUG worker.py:625 -- Releasing ab5085433c874bffffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866	DEBUG worker.py:625 -- Releasing 0296b6a385a0ebabffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866	DEBUG worker.py:625 -- Releasing bcacbd0d9fe6d0caffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866	DEBUG worker.py:625 -- Releasing d9811d3c1d420ce1ffffffffffffffffffffffff0100000001000000
[10000000, 10000000, 10000000, 10000000, 10000000, 10000000, 10000000, 10000000, 10000000, 10000000]
2022-09-02 21:24:43,247	DEBUG dataclient.py:277 -- Shutting down data channel.
(base)

I agree that grpc.Error isn't that useful on its own, but the iterator will also catch anything that subclasses grpc.Error which might have useful information. For example, it could catch this or this.

You are right. I have added a new type-check and test case to make sure that the new change does not catch those sub-exceptions.

@ckw017
Copy link
Member

ckw017 commented Sep 2, 2022

The server side logs are mostly there for debugging (users won't check there usually unless they're trying to find what broke). I'm still wondering why we would avoid logging the exception that occurred. Even if it's a generic RPCError, logging the exception will still contain stack trace information that can give info about what specific codepath broke.

@ckw017
Copy link
Member

ckw017 commented Sep 2, 2022

Overall, I'm not sure if ignoring the error and gracefully terminating in the scenario that the iterator broke because of a grpc error is a good idea, it's essentially just burying the issue. For example, if the Log channel is reconnecting message is suppressed because we suppress the RPCError, then users may have their application level logs dropped silently without any warning.

@rueian rueian force-pushed the stop-proxy-cancelled-grpc-stream branch from e0e16b9 to 94d75ad Compare September 2, 2022 23:48
@rueian
Copy link
Contributor Author

rueian commented Sep 3, 2022

Even if it's a generic RPCError, logging the exception will still contain stack trace information that can give info about what specific codepath broke.

Sure, I have added the stack trace.

So, in addition to translating the CANCELLED grpc.RpcError into StopIteration, the server's logs now become this:
image

instead of the original one that traces into grpc/_channel.py and just leaves an "Exception iterating requests!"
image

I hope this change can save others some time for debugging the CANCELLED grpc request stream.

For example, if the Log channel is reconnecting message is suppressed because we suppress the RPCError, then users may have their application level logs dropped silently without any warning.

I think that raising StopIteration in the request stream will not lead to dropping users' application level logs, since the response stream is still active.

Copy link
Member

@ckw017 ckw017 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good now, thanks! If you can merge master and re-run, we can get this merged.

@rueian rueian force-pushed the stop-proxy-cancelled-grpc-stream branch 2 times, most recently from bedba55 to 38a2a4c Compare September 9, 2022 05:05
@rueian
Copy link
Contributor Author

rueian commented Sep 10, 2022

Hi @ckw017,

Thank you for your help. I have rebased the commit and all non-pending tests are passed now.

@stale
Copy link

stale bot commented Oct 12, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@stale stale bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label Oct 12, 2022
@rueian rueian force-pushed the stop-proxy-cancelled-grpc-stream branch from 38a2a4c to 39cf221 Compare October 12, 2022 12:50
@rueian
Copy link
Contributor Author

rueian commented Oct 12, 2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

  • If you'd like to keep this open, just leave any comment, and the stale label will be removed.

Not staled.

Hi @ckw017, sorry to bother you, could you review this again?

@stale stale bot removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels Oct 12, 2022
@zhe-thoughts
Copy link
Collaborator

@ckw017 @architkulkarni @ijrsvt Could you help follow up here? Thanks for helping our new contributor so far

@architkulkarni
Copy link
Contributor

The failed tests look unrelated and were flaky on master in the past. Some failed with "failure to upload artifact" so I've restarted those to be safe. I'll merge this PR once they pass.

@architkulkarni
Copy link
Contributor

Doc failures (no module starlette) unrelated and was broken on master
AIR example failures unrelated and were flaky on master
test_fastapi failure unrelated and was broken on master
Minimal install failure (typing_extensions) was flaky on master, unrelated
test_gpu was flaky on master
Workflows failure ("deprecated() got an unexpected keyword argument 'name'") unrelated

@architkulkarni architkulkarni added the tests-ok The tagger certifies test failures are unrelated and assumes personal liability. label Oct 27, 2022
@architkulkarni architkulkarni merged commit 694f57f into ray-project:master Oct 27, 2022
WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this pull request Dec 19, 2022
…3865) (ray-project#27951)

This fixes the below grpc error mentioned in ray-project#23865.

grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception iterating requests!"
        debug_error_string = "None"
>
This error happens when proxying the grpc stream to the individual SpecificServer but the incoming grpc stream is already canceled. (https://github.com/grpc/grpc/blob/v1.43.0/src/python/grpcio/grpc/_server.py#L353-L354)

Therefore, just wrap the request_iterator with a new RequestIteratorProxy which will catch the _CANCELLED signal.

Related issue number
ray-project#23865

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants