-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Stop iteratoring cancelled grpc request streams (#23865) #27951
[Core] Stop iteratoring cancelled grpc request streams (#23865) #27951
Conversation
f3f645e
to
d0016f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, the fix looks reasonable--could you please add a test for this behavior?
Sure, I will add it tomorrow. |
d0016f1
to
24f6c20
Compare
Hi @architkulkarni, Sorry for the delay, I have updated the test case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, but this looks LGTM!
@rueian I tried running through your script in #23865 (go proxy on localhost:8888) and wasn't able to reproduce the issue, so I might be missing some context. Does this change fix your original problem with the proxy or does it just suppress the error message? Expected behavior would be that script exits gracefully after printing out |
try: | ||
return next(self.request_iterator) | ||
except grpc.RpcError: | ||
# stop iteration if the stream is CANCLLED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add logger.exception("Iterator failed with the following exception:")
here, that way the cause of error can be logged somewhere instead of being silently discarded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @ckw017, Instead of using logger.exception
, I use logger.warning
instead. Since logger.exception
is not very helpful in the case of grpc.RpcError
. Please see the below screenshot.
Hi @ckw017, Without this change, the script will still print out the expected result anyway but not quite gracefully. There will be also a and there will be This change actually translates I agree that we shouldn't silently discard the and since the |
24f6c20
to
c2d21ce
Compare
I agree that grpc.Error isn't that useful on its own, but the iterator will also catch anything that subclasses grpc.Error which might have useful information. For example, it could catch this or this. Out of curiosity, does this get rid of the
|
19df9d3
to
e0e16b9
Compare
Hi @ckw017, I have added, but unfortunately, the log was not really helpful: DEBUG:ray:[ray] Forcing OMP_NUM_THREADS=1 to avoid performance degradation with many workers (issue #6998). You can override this by explicitly setting OMP_NUM_THREADS.
2022-09-02 21:24:37,115 INFO client_builder.py:247 -- Passing the following kwargs to ray.init() on the server: logging_level
2022-09-02 21:24:37,117 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.IDLE
2022-09-02 21:24:37,322 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.CONNECTING
2022-09-02 21:24:37,323 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.READY
2022-09-02 21:24:38,334 DEBUG worker.py:796 -- Pinging server.
Log channel is reconnecting. Logs produced while the connection was down can be found on the head node of the cluster in `ray_client_server_[port].out`
2022-09-02 21:24:38,337 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.CONNECTING
2022-09-02 21:24:38,338 DEBUG worker.py:378 -- client gRPC channel state change: ChannelConnectivity.READY
2022-09-02 21:24:38,948 DEBUG worker.py:629 -- Retaining 00b119b77d8c0eba44c665e31bc59911cf7f9a0d0100000001000000
2022-09-02 21:24:38,948 DEBUG worker.py:553 -- Scheduling task get_dashboard_url 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x01\x00\x00\x00'
2022-09-02 21:24:38,950 DEBUG worker.py:629 -- Retaining a0972fac8691ffaaffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:39,292 DEBUG worker.py:625 -- Releasing a0972fac8691ffaaffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,327 DEBUG worker.py:629 -- Retaining 00b119b77d8c0eba44c665e31bc59911cf7f9a0d0100000002000000
2022-09-02 21:24:42,352 DEBUG worker.py:629 -- Retaining 00b119b77d8c0eba44c665e31bc59911cf7f9a0d0100000003000000
2022-09-02 21:24:42,352 DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,352 DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,352 DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,352 DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,353 DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,353 DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,353 DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,354 DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,354 DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,354 DEBUG worker.py:553 -- Scheduling task length 0 b'\x00\xb1\x19\xb7}\x8c\x0e\xbaD\xc6e\xe3\x1b\xc5\x99\x11\xcf\x7f\x9a\r\x01\x00\x00\x00\x03\x00\x00\x00'
2022-09-02 21:24:42,355 DEBUG worker.py:629 -- Retaining d9811d3c1d420ce1ffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,356 DEBUG worker.py:629 -- Retaining bcacbd0d9fe6d0caffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,356 DEBUG worker.py:629 -- Retaining 0296b6a385a0ebabffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,356 DEBUG worker.py:629 -- Retaining ab5085433c874bffffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,357 DEBUG worker.py:629 -- Retaining bf43bb834add5fbeffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,357 DEBUG worker.py:629 -- Retaining d7465eca34a71d1bffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,357 DEBUG worker.py:629 -- Retaining 7f4d261ddb36ab9bffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,357 DEBUG worker.py:629 -- Retaining 8d9e2c5883d6d8f9ffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,358 DEBUG worker.py:629 -- Retaining a3044293ddc3bb69ffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,358 DEBUG worker.py:629 -- Retaining e1f20a4cf4920a9affffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866 DEBUG worker.py:625 -- Releasing e1f20a4cf4920a9affffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866 DEBUG worker.py:625 -- Releasing a3044293ddc3bb69ffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866 DEBUG worker.py:625 -- Releasing 8d9e2c5883d6d8f9ffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866 DEBUG worker.py:625 -- Releasing 7f4d261ddb36ab9bffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866 DEBUG worker.py:625 -- Releasing d7465eca34a71d1bffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866 DEBUG worker.py:625 -- Releasing bf43bb834add5fbeffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866 DEBUG worker.py:625 -- Releasing ab5085433c874bffffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866 DEBUG worker.py:625 -- Releasing 0296b6a385a0ebabffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866 DEBUG worker.py:625 -- Releasing bcacbd0d9fe6d0caffffffffffffffffffffffff0100000001000000
2022-09-02 21:24:42,866 DEBUG worker.py:625 -- Releasing d9811d3c1d420ce1ffffffffffffffffffffffff0100000001000000
[10000000, 10000000, 10000000, 10000000, 10000000, 10000000, 10000000, 10000000, 10000000, 10000000]
2022-09-02 21:24:43,247 DEBUG dataclient.py:277 -- Shutting down data channel.
(base)
You are right. I have added a new type-check and test case to make sure that the new change does not catch those sub-exceptions. |
The server side logs are mostly there for debugging (users won't check there usually unless they're trying to find what broke). I'm still wondering why we would avoid logging the exception that occurred. Even if it's a generic RPCError, logging the exception will still contain stack trace information that can give info about what specific codepath broke. |
Overall, I'm not sure if ignoring the error and gracefully terminating in the scenario that the iterator broke because of a grpc error is a good idea, it's essentially just burying the issue. For example, if the |
e0e16b9
to
94d75ad
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good now, thanks! If you can merge master and re-run, we can get this merged.
bedba55
to
38a2a4c
Compare
Hi @ckw017, Thank you for your help. I have rebased the commit and all non-pending tests are passed now. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.
|
…3865) Signed-off-by: Rueian <[email protected]>
38a2a4c
to
39cf221
Compare
Not staled. Hi @ckw017, sorry to bother you, could you review this again? |
@ckw017 @architkulkarni @ijrsvt Could you help follow up here? Thanks for helping our new contributor so far |
The failed tests look unrelated and were flaky on master in the past. Some failed with "failure to upload artifact" so I've restarted those to be safe. I'll merge this PR once they pass. |
Doc failures (no module starlette) unrelated and was broken on master |
…3865) (ray-project#27951) This fixes the below grpc error mentioned in ray-project#23865. grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNKNOWN details = "Exception iterating requests!" debug_error_string = "None" > This error happens when proxying the grpc stream to the individual SpecificServer but the incoming grpc stream is already canceled. (https://github.com/grpc/grpc/blob/v1.43.0/src/python/grpcio/grpc/_server.py#L353-L354) Therefore, just wrap the request_iterator with a new RequestIteratorProxy which will catch the _CANCELLED signal. Related issue number ray-project#23865 Signed-off-by: Weichen Xu <[email protected]>
Why are these changes needed?
This fixes the below grpc error mentioned in #23865.
This error happens when proxying the grpc stream to the individual SpecificServer but the incoming grpc stream is already canceled. (https://github.com/grpc/grpc/blob/v1.43.0/src/python/grpcio/grpc/_server.py#L353-L354)
Therefore, just wrap the
request_iterator
with a newRequestIteratorProxy
which will catch the_CANCELLED
signal.Related issue number
#23865
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.