Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Client] Unable to connect to client when utilizing a proxy #23865

Closed
peterhaddad3121 opened this issue Apr 12, 2022 · 15 comments
Closed

[Ray Client] Unable to connect to client when utilizing a proxy #23865

peterhaddad3121 opened this issue Apr 12, 2022 · 15 comments
Labels
bug Something that is supposed to be working; but isn't core-client ray client related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)

Comments

@peterhaddad3121
Copy link

peterhaddad3121 commented Apr 12, 2022

What happened + What you expected to happen

When utilizing a Ray Cluster, and setting the environment variables of the pod i.e the gRPC proxy and http proxy variables, then I am unable to connect to the remote cluster using ray.init(). This only occurs when proxies are set, and I assure that my no_proxy environment variables are set properly.

Are there specific configurations that need to be set when running behind a proxy? I am rebuilding the image with those specified environment variables. Essentially, I am trying to use runtime_environments to install pip packages; however, when the proxy settings are set, this inhibits the gRPC client is unable to connect to the client.

These are the specific errors I receive:

INFO:ray.util.client.server.server:Starting Ray Client server on 0.0.0.0:10001
INFO:ray.util.client.server.proxier:New data connection from client 0d8a49073b31478a9518ecc41101c230: 
INFO:ray.util.client.server.proxier:SpecificServer started on port: 23000 with PID: 298 for client: 0d8a49073b31478a9518ecc41101c230
ERROR:ray.util.client.server.proxier:Timeout waiting for channel for 0d8a49073b31478a9518ecc41101c230
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 352, in get_channel
    server.channel).result(timeout=CHECK_CHANNEL_TIMEOUT_S)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_utilities.py", line 139, in result
    self._block(timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_utilities.py", line 85, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
ERROR:ray.util.client.server.proxier:Timeout waiting for channel for 0d8a49073b31478a9518ecc41101c230
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 352, in get_channel
    server.channel).result(timeout=CHECK_CHANNEL_TIMEOUT_S)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_utilities.py", line 139, in result
    self._block(timeout)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_utilities.py", line 85, in _block
    raise grpc.FutureTimeoutError()
grpc.FutureTimeoutError
WARNING:ray.util.client.server.proxier:Retrying Logstream connection. 1 attempts failed.
ERROR:ray.util.client.server.proxier:Channel not found for 0d8a49073b31478a9518ecc41101c230
$ ERROR:ray.util.client.server.proxier:Channel not found for 0d8a49073b31478a9518ecc41101c230
sh: 19: ERROR:ray.util.client.server.proxier:Channel: not found

I was able to workaround this issue by manually setting the pip environment variables to utilize the proxy while the gRPC server does not. However, I do not believe this is the best workaround. However, I still do receive some proxy errors in the head nodes logs:

ERROR:ray.util.client.server.proxier:Proxying Datapath failed!
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/util/client/server/proxier.py", line 659, in Datapath
    for resp in resp_stream:
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception iterating requests!"
        debug_error_string = "None"
>

Additionally, I have noticed issues when a worker node tries to install the pip packages. The Ray head node is able to, however worker nodes are not. Any insight would be appreciated.

(raylet, ip=10.131.4.120) [2022-04-12 12:29:46,896 E 127 127] (raylet) agent_manager.cc:190: Failed to create runtime env: {"envVars": {"PIP_NO_CACHE_DIR": "1"}, "extensions": {"_ray_commit": "fec30a25dbb5f3fa81d2bf419f75f5d40bc9fc39"}, "pythonRuntimeEnv": {"pipRuntimeEnv": {"config": {"packages": ["torch"]}}}, "uris": {"pipUri": "pip://9c0e795652de796e136d31847c5f97c28b7d09d7"}}, error message: Failed to install pip requirements:
(raylet, ip=10.131.4.120) Collecting torch
(raylet, ip=10.131.4.120) Downloading torch-1.11.0-cp37-cp37m-manylinux1_x86_64.whl (750.6 MB)
(raylet, ip=10.131.4.120) 
(raylet, ip=10.131.4.120) [2022-04-12 12:29:46,897 E 127 127] (raylet) worker_pool.cc:623: [Eagerly] Couldn't create a runtime environment for job 01000000.
(raylet, ip=10.131.4.120) [2022-04-12 12:29:46,897 E 127 127] (raylet) agent_manager.cc:190: Failed to create runtime env: {"envVars": {"PIP_NO_CACHE_DIR": "1"}, "extensions": {"_ray_commit": "fec30a25dbb5f3fa81d2bf419f75f5d40bc9fc39"}, "pythonRuntimeEnv": {"pipRuntimeEnv": {"config": {"packages": ["torch"]}}}, "uris": {"pipUri": "pip://9c0e795652de796e136d31847c5f97c28b7d09d7"}}, error message: Failed to install pip requirements:
(raylet, ip=10.131.4.120) Collecting torch
(raylet, ip=10.131.4.120) Downloading torch-1.11.0-cp37-cp37m-manylinux1_x86_64.whl (750.6 MB)
(raylet, ip=10.131.4.120) 
(pid=gcs_server) [2022-04-12 12:29:46,897 E 126 126] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 078ddb83b3aa89732f7ed6578e448bc4f1fadf340e14edbdbbc251d5 for actor 24c732320cea8daebcbfbce701000000(PPOTrainer.__init__) has been canceled, job id = 01000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILE

Versions / Dependencies

Ray v1.11.0
Python 3.7.12

Reproduction script

Using ray.init in a Jupyter Notebook

@peterhaddad3121 peterhaddad3121 added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 12, 2022
@peterhaddad3121 peterhaddad3121 changed the title Ray Core - Unable to connect to client when utilizing a proxy Ray Core & Ray Tune - Unable to connect to client when utilizing a proxy Apr 12, 2022
@architkulkarni
Copy link
Contributor

Does it work if you install a different package (e.g. pip-install-test) instead of torch? It might be an issue with pip running out of memory when installing torch, as you can see in the traceback the pip install process halts in the middle of installing torch.

@peterhaddad3121
Copy link
Author

peterhaddad3121 commented Apr 12, 2022

Hi @architkulkarni,
Thanks for getting back to me!

I've increased the memory of the nodes to 4Gi.
Also, this is what my env_runtime looks like: runtime_env={"pip": ["pip-install-test"]}

Going into the worker environment logs it does seem that pip-install-test does get installed; however, I am still seeing errors during initialization of the environment:

Going into the terminal of the worker node:

cat raylet.err

[2022-04-12 15:21:52,520 E 127 127] (raylet) agent_manager.cc:190: Failed to create runtime env: {"extensions": {"_ray_commit": "fec30a25dbb5f3fa81d2bf419f75f5d40bc9fc39"}, "pythonRuntimeEnv": {"pipRuntimeEnv": {"config": {"packages": ["pip-install-test"]}}}, "uris": {"pipUri": "pip://c39c09b8efb94733a272ea012096c7cc7044dd36"}}, error message: [Errno 2] No such file or directory: '/tmp/ray/session_2022-04-12_15-17-29_080471_1450/runtime_resources/pip/c39c09b8efb94733a272ea012096c7cc7044dd36'
[2022-04-12 15:21:52,520 E 127 127] (raylet) worker_pool.cc:623: [Eagerly] Couldn't create a runtime environment for job 02000000.

Error in ray-base-ray-head (head node)

Error: Traceback (most recent call last): File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/optional_utils.py", line 86, in _handler_route return await handler(bind_info.instance, req) File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/dashboard/modules/node/node_head.py", line 219, in get_errors ip = req.query["ip"] KeyError: 'ip'

Additionally, here is the trace from the notebook:

(pid=runtime_env, ip=10.131.4.197) 2022-04-12 15:21:49,761	INFO conda_utils.py:198 -- ERROR: Could not open requirements file: [Errno 2] No such file or directory: '/tmp/ray/session_2022-04-12_15-17-29_080471_1450/runtime_resources/pip/c39c09b8efb94733a272ea012096c7cc7044dd36/requirements.txt'
(pid=runtime_env, ip=10.131.4.197) 2022-04-12 15:21:50,020	INFO conda_utils.py:198 -- 
(pid=runtime_env, ip=10.131.4.197) 2022-04-12 15:21:51,507	INFO conda_utils.py:198 -- Collecting pip-install-test
(pid=runtime_env, ip=10.131.4.197) 2022-04-12 15:21:51,685	INFO conda_utils.py:198 -- Downloading pip_install_test-0.5-py3-none-any.whl (1.7 kB)
(pid=runtime_env, ip=10.131.4.197) 2022-04-12 15:21:52,401	INFO conda_utils.py:198 -- Installing collected packages: pip-install-test
(pid=runtime_env, ip=10.131.4.197) 2022-04-12 15:21:52,408	INFO conda_utils.py:198 -- Successfully installed pip-install-test-0.5
(pid=runtime_env, ip=10.131.4.197) 2022-04-12 15:21:52,518	INFO conda_utils.py:198 -- 
(run pid=1834) 2022-04-12 15:21:52,522	ERROR trial_runner.py:920 -- Trial PPOTrainer_CoinGame_97c9a_00000: Error processing event.
(run pid=1834) Traceback (most recent call last):
(run pid=1834)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 886, in _process_trial
(run pid=1834)     results = self.trial_executor.fetch_result(trial)
(run pid=1834)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/tune/ray_trial_executor.py", line 675, in fetch_result
(run pid=1834)     result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
(run pid=1834)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
(run pid=1834)     return func(*args, **kwargs)
(run pid=1834)   File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/worker.py", line 1765, in get
(run pid=1834)     raise value
(run pid=1834) ray.exceptions.RuntimeEnvSetupError: The runtime_env failed to be set up.
(pid=gcs_server) [2022-04-12 15:21:52,521 E 1454 1454] (gcs_server) gcs_actor_scheduler.cc:320: The lease worker request from node 1ab5fdc1974a84a6fa2c44634c011d01442f3c192c372ad69fd9a652 for actor 9ff9b3de2da70ffb6322c93b02000000(PPOTrainer.__init__) has been canceled, job id = 02000000, cancel type: SCHEDULING_CANCELLED_RUNTIME_ENV_SETUP_FAILED

It does seem that the packages are being installed, but seems that the environment is unable to initialize?

@architkulkarni
Copy link
Contributor

Thanks for the details--from the missing file error it looks like it might be a bug in runtime environments where some necessary files are getting deleted too early.

What's interesting is that this only occurs when proxies are set--I'm not sure yet how that could be related. I'd be curious to know if the issue still persists on ray==1.12.0rc1 or on the nightly wheels.

What would be the simplest way to reproduce this? Also, does it happen every time, or randomly?

@peterhaddad3121
Copy link
Author

No problem! Want to help identify these opportunities!

I actually think it's inconsistent! Got further this time; however, it is still expecting Pytorch in my specific job. Baking it into the image to see if I can get success.

This is the specific source code I am testing against. This was an example in the community:

##########
# Contribution by the Center on Long-Term Risk:
# https://github.com/longtermrisk/marltoolbox
##########
import argparse
import os

import ray
from ray import tune
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.examples.env.coin_game_non_vectorized_env import CoinGame, AsymCoinGame

def main(debug, stop_iters=2000, tf=False, asymmetric_env=False):
    train_n_replicates = 1 if debug else 1
    seeds = list(range(train_n_replicates))

    stop = {
        "training_iteration": 2 if debug else stop_iters,
    }

    env_config = {
        "players_ids": ["player_red", "player_blue"],
        "max_steps": 20,
        "grid_size": 3,
        "get_additional_info": True,
    }

    rllib_config = {
        "env": AsymCoinGame if asymmetric_env else CoinGame,
        "env_config": env_config,
        "multiagent": {
            "policies": {
                env_config["players_ids"][0]: (
                    None,
                    AsymCoinGame(env_config).OBSERVATION_SPACE,
                    AsymCoinGame.ACTION_SPACE,
                    {},
                ),
                env_config["players_ids"][1]: (
                    None,
                    AsymCoinGame(env_config).OBSERVATION_SPACE,
                    AsymCoinGame.ACTION_SPACE,
                    {},
                ),
            },
            "policy_mapping_fn": lambda agent_id, **kwargs: agent_id,
        },
        # Size of batches collected from each worker.
        "rollout_fragment_length": 20,
        # Number of timesteps collected for each SGD round.
        # This defines the size of each SGD epoch.
        "train_batch_size": 512,
        "model": {
            "dim": env_config["grid_size"],
            "conv_filters": [
                [16, [3, 3], 1],
                [32, [3, 3], 1],
            ],  # [Channel, [Kernel, Kernel], Stride]]
        },
        "lr": 5e-3,
        "seed": tune.grid_search(seeds),
        "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
        "framework": "tf" if tf else "torch",
    }

    tune_analysis = tune.run(
        PPOTrainer,
        config=rllib_config,
        stop=stop,
        checkpoint_freq=0,
        checkpoint_at_end=False,
        name="PPO_AsymCG",
    )
    ray.shutdown()
    return tune_analysis


if __name__ == "__main__":
    args = parser.parse_args()
    debug_mode = True
    use_asymmetric_env = False
    main(debug_mode, use_asymmetric_env)

This is still pretty inconsistent though, it seems that the connection sometimes drops. Curious if this is because of the timeout? I will give 1.12.0rc1 a try to see if I get better results.

@peterhaddad3121
Copy link
Author

To provide additional context, this seems to be an issue with any large package. I increased the memory to 16GI for the head node, still receive timeout errors and repeats for installing packages.

@amogkam amogkam changed the title Ray Core & Ray Tune - Unable to connect to client when utilizing a proxy Ray Core - Unable to connect to client when utilizing a proxy Apr 18, 2022
@clarkzinzow clarkzinzow changed the title Ray Core - Unable to connect to client when utilizing a proxy [Core] Unable to connect to client when utilizing a proxy Apr 22, 2022
@clarkzinzow clarkzinzow changed the title [Core] Unable to connect to client when utilizing a proxy [Ray Client] Unable to connect to client when utilizing a proxy Apr 22, 2022
@architkulkarni
Copy link
Contributor

architkulkarni commented Apr 22, 2022

Hi @peterhaddad3121, we've tried to reproduce this on a multinode cluster with the following:

import ray
import os
import time

ray.init(
    "ray://<cluster>",
    runtime_env={
        "pip": ["torch"],
    })


@ray.remote
def f():
    return 1


print(ray.get([f.remote() for _ in range(20)]))

running it multiple times, but we couldn't reproduce it. Do you have any more hints about how to make our setup closer to yours? Also, did the problem persist with Ray 1.12.0?

@peterhaddad3121
Copy link
Author

Hey @architkulkarni, thank you for looking into this!

I upgraded to 1.12.0 and looks like we are having success on thee Head node when installing torch, but not on the workers.

These are thee errors I am seeing on the workers, both utilize the same image.

2022-04-25 07:43:22,608 INFO pip.py:342 -- Delete incomplete virtualenv: /tmp/ray/session_2022-04-25_07-33-59_503456_466/runtime_resources/pip/6aa4b4555672b194ebb27178e50d10b97086d35a
2022-04-25 07:43:22,609 ERROR pip.py:344 -- Failed to install pip packages.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.6/site-packages/ray/_private/runtime_env/pip.py", line 331, in _run
    logger,
  File "/home/ray/anaconda3/lib/python3.6/site-packages/ray/_private/runtime_env/pip.py", line 302, in _install_pip_packages
    await check_output_cmd(pip_install_cmd, logger=logger, cwd=cwd, env=pip_env)
  File "/home/ray/anaconda3/lib/python3.6/site-packages/ray/_private/runtime_env/utils.py", line 102, in check_output_cmd
    proc.returncode, cmd, output=stdout, cmd_index=cmd_index
ray._private.runtime_env.utils.SubprocessCalledProcessError: Run cmd[11] failed with the following details.
Command '['/tmp/ray/session_2022-04-25_07-33-59_503456_466/runtime_resources/pip/6aa4b4555672b194ebb27178e50d10b97086d35a/virtualenv/bin/python', '-m', 'pip', 'install', '--disable-pip-version-check', '--no-cache-dir', '-r', '/tmp/ray/session_2022-04-25_07-33-59_503456_466/runtime_resources/pip/6aa4b4555672b194ebb27178e50d10b97086d35a/requirements.txt']' returned non-zero exit status 1.
Last 50 lines of stdout:
    Collecting torch
      Downloading torch-1.10.2-cp36-cp36m-manylinux1_x86_64.whl (881.9 MB)
    ERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory

@edoakes edoakes removed the platform label Apr 25, 2022
@architkulkarni
Copy link
Contributor

I'm not sure what's causing the failure here--it's just running a pip install on the worker nodes. If you manually run pip install torch on the worker node, does it fail with the same error? Just brainstorming here.
@Catch-Bull have you ever seen this error before?

@peterhaddad3121
Copy link
Author

peterhaddad3121 commented Apr 25, 2022

Hi @architkulkarni, I was able to successfully get past this issue by setting the proxies of pip and the image i.e http_proxy, https_proxy, and no_proxy. However, when doing this it seems that the Ray Client i.e 10001 has broke when doing an interactive shell.

I believe I can work past this by unsetting the proxies and only setting the pip_config proxy; however, are there workarounds to get past this???

ERROR proxier.py:371 -- Timeout waiting for channel for 7b236c5f655842c9a85bed6263f00b969Traceback (most recent call last):10  File "/home/ray/anaconda3/lib/python3.6/site-packages/ray/util/client/server/proxier.py", line 367, in get_channel11    timeout=CHECK_CHANNEL_TIMEOUT_S12  File "/home/ray/anaconda3/lib/python3.6/site-packages/grpc/_utilities.py", line 139, in result13    self._block(timeout)14  File "/home/ray/anaconda3/lib/python3.6/site-packages/grpc/_utilities.py", line 85, in _block15    raise grpc.FutureTimeoutError()16grpc.FutureTimeoutError172022-04-25 16:05:05,362	ERROR proxier.py:371 -- Timeout waiting for channel for 7b236c5f655842c9a85bed6263f00b9618Traceback (most recent call last):19  File "/home/ray/anaconda3/lib/python3.6/site-packages/ray/util/client/server/proxier.py", line 367, in get_channel20    timeout=CHECK_CHANNEL_TIMEOUT_S21  File "/home/ray/anaconda3/lib/python3.6/site-packages/grpc/_utilities.py", line 139, in result22    self._block(timeout)23  File "/home/ray/anaconda3/lib/python3.6/site-packages/grpc/_utilities.py", line 85, in _block24    raise grpc.FutureTimeoutError()25grpc.FutureTimeoutError262022-04-25 16:05:05,363	WARNING proxier.py:749 -- Retrying Logstream connection. 1 attempts failed.272022-04-25 16:05:05,363	ERROR proxier.py:664 -- Channel not found for 7b236c5f655842c9a85bed6263f00b96

Looks like there is a grpc not proxy variable; however, unsure if there are recommendations from the Ray community on the best approach. https://grpc.github.io/grpc/cpp/md_doc_environment_variables.html

@rueian
Copy link
Contributor

rueian commented Aug 15, 2022

I got the same grpc "Exception iterating requests!" error when connecting to a Ray Head behind a golang grpc proxy server.

I am using Ray 1.13.0

2022-08-15 23 42 21

@architkulkarni
Copy link
Contributor

Hi @rueian, sorry you're running into that. Do you have a way of reproducing this error? What code are you running?

@rueian
Copy link
Contributor

rueian commented Aug 15, 2022

Hi @architkulkarni,

I am running this:

import ray
ray.init(address='ray://127.0.0.1:8888')

@ray.remote
def length(ds):
    return len(ds)

ref = ray.put([1 for i in range(10000000)])
print(ray.get([length.remote(ref) for _ in range(10)]))

to my local ray head:

▶ ray --version
ray, version 1.13.0
▶ ray start --head --ray-client-server-port=10001
Local node IP: 127.0.0.1
2022-08-16 00:06:13,452	INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8265

through a golang grpc proxy server:

// run this code with 'go run main.go -listen 127.0.0.1:8888 -target 127.0.0.1:10001'
package main

import (
	"flag"
	"github.com/mwitkow/grpc-proxy/proxy"
	"google.golang.org/grpc"
	"google.golang.org/grpc/credentials/insecure"
	"google.golang.org/grpc/keepalive"
	"log"
	"math"
	"net"
)

func main() {
	bind := flag.String("listen", "0.0.0.0:8888", "listen address, default: 0.0.0.0:8888")
	addr := flag.String("target", "127.0.0.1:10001", "the proxy target address, ex: my.ray.server:10001")
	flag.Parse()

	ln, err := net.Listen("tcp", *bind)
	if err != nil {
		log.Fatalf("fail to listen on %s: %v\n", *bind, err)
	}

	cc, err := grpc.Dial(*addr, grpc.WithTransportCredentials(insecure.NewCredentials()))
	if err != nil {
		log.Fatalf("fail to dial on %s: %v\n", *addr, err)
	}

	server := proxy.NewProxy(cc,
		grpc.MaxConcurrentStreams(math.MaxUint32),
		grpc.MaxRecvMsgSize(math.MaxUint32),
		grpc.MaxSendMsgSize(math.MaxUint32),
		grpc.KeepaliveParams(keepalive.ServerParameters{
			Time:    1000 * 30,
			Timeout: 1000 * 600,
		}))

	log.Printf("server listens on %s to proxy %s\n", *bind, *addr)
	if err := server.Serve(ln); err != nil {
		log.Fatalf("server exit with err: %v\n", err)
	}
}

Then the driver will print out "Log channel is reconnecting. Logs produced while the connection was down can be found on the head node of the cluster in ray_client_server_[port].out" and the logs/ray_client_server.err file contains the above snapshot content.

Sorry for the inconvenience to golang, but this is the only proxy I have now.

@architkulkarni
Copy link
Contributor

Thanks for the details! This should be enough for us to try to reproduce the error. @ckw017 do you happen to know anything about the proxy issue @rueian is facing?

@rueian
Copy link
Contributor

rueian commented Aug 17, 2022

Hi @architkulkarni,

I have found the cause and have created a PR for the "Exception iterating requests!" error.

#27951

rueian added a commit to rueian/ray that referenced this issue Oct 12, 2022
@zhe-thoughts zhe-thoughts added the core-client ray client related issues label Oct 26, 2022
architkulkarni pushed a commit that referenced this issue Oct 27, 2022
This fixes the below grpc error mentioned in #23865.

grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception iterating requests!"
        debug_error_string = "None"
>
This error happens when proxying the grpc stream to the individual SpecificServer but the incoming grpc stream is already canceled. (https://github.com/grpc/grpc/blob/v1.43.0/src/python/grpcio/grpc/_server.py#L353-L354)

Therefore, just wrap the request_iterator with a new RequestIteratorProxy which will catch the _CANCELLED signal.

Related issue number
#23865
@zhe-thoughts
Copy link
Collaborator

Closing this issue as the root cause has been found and PR merged

WeichenXu123 pushed a commit to WeichenXu123/ray that referenced this issue Dec 19, 2022
…3865) (ray-project#27951)

This fixes the below grpc error mentioned in ray-project#23865.

grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.UNKNOWN
        details = "Exception iterating requests!"
        debug_error_string = "None"
>
This error happens when proxying the grpc stream to the individual SpecificServer but the incoming grpc stream is already canceled. (https://github.com/grpc/grpc/blob/v1.43.0/src/python/grpcio/grpc/_server.py#L353-L354)

Therefore, just wrap the request_iterator with a new RequestIteratorProxy which will catch the _CANCELLED signal.

Related issue number
ray-project#23865

Signed-off-by: Weichen Xu <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core-client ray client related issues triage Needs triage (eg: priority, bug/not-bug, and owning component)
Projects
None yet
Development

No branches or pull requests

5 participants