-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ray Client] Unable to connect to client when utilizing a proxy #23865
Comments
Does it work if you install a different package (e.g. |
Hi @architkulkarni, I've increased the memory of the nodes to 4Gi. Going into the worker environment logs it does seem that pip-install-test does get installed; however, I am still seeing errors during initialization of the environment: Going into the terminal of the worker node: cat raylet.err
Error in ray-base-ray-head (head node)
Additionally, here is the trace from the notebook:
It does seem that the packages are being installed, but seems that the environment is unable to initialize? |
Thanks for the details--from the missing file error it looks like it might be a bug in runtime environments where some necessary files are getting deleted too early. What's interesting is that this only occurs when proxies are set--I'm not sure yet how that could be related. I'd be curious to know if the issue still persists on What would be the simplest way to reproduce this? Also, does it happen every time, or randomly? |
No problem! Want to help identify these opportunities! I actually think it's inconsistent! Got further this time; however, it is still expecting Pytorch in my specific job. Baking it into the image to see if I can get success. This is the specific source code I am testing against. This was an example in the community:
This is still pretty inconsistent though, it seems that the connection sometimes drops. Curious if this is because of the timeout? I will give 1.12.0rc1 a try to see if I get better results. |
To provide additional context, this seems to be an issue with any large package. I increased the memory to 16GI for the head node, still receive timeout errors and repeats for installing packages. |
Hi @peterhaddad3121, we've tried to reproduce this on a multinode cluster with the following:
running it multiple times, but we couldn't reproduce it. Do you have any more hints about how to make our setup closer to yours? Also, did the problem persist with Ray 1.12.0? |
Hey @architkulkarni, thank you for looking into this! I upgraded to 1.12.0 and looks like we are having success on thee Head node when installing torch, but not on the workers. These are thee errors I am seeing on the workers, both utilize the same image.
|
I'm not sure what's causing the failure here--it's just running a |
Hi @architkulkarni, I was able to successfully get past this issue by setting the proxies of pip and the image i.e http_proxy, https_proxy, and no_proxy. However, when doing this it seems that the Ray Client i.e 10001 has broke when doing an interactive shell. I believe I can work past this by unsetting the proxies and only setting the pip_config proxy; however, are there workarounds to get past this???
Looks like there is a grpc not proxy variable; however, unsure if there are recommendations from the Ray community on the best approach. https://grpc.github.io/grpc/cpp/md_doc_environment_variables.html |
Hi @rueian, sorry you're running into that. Do you have a way of reproducing this error? What code are you running? |
Hi @architkulkarni, I am running this: import ray
ray.init(address='ray://127.0.0.1:8888')
@ray.remote
def length(ds):
return len(ds)
ref = ray.put([1 for i in range(10000000)])
print(ray.get([length.remote(ref) for _ in range(10)])) to my local ray head: ▶ ray --version
ray, version 1.13.0
▶ ray start --head --ray-client-server-port=10001
Local node IP: 127.0.0.1
2022-08-16 00:06:13,452 INFO services.py:1470 -- View the Ray dashboard at http://127.0.0.1:8265 through a golang grpc proxy server: // run this code with 'go run main.go -listen 127.0.0.1:8888 -target 127.0.0.1:10001'
package main
import (
"flag"
"github.com/mwitkow/grpc-proxy/proxy"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
"google.golang.org/grpc/keepalive"
"log"
"math"
"net"
)
func main() {
bind := flag.String("listen", "0.0.0.0:8888", "listen address, default: 0.0.0.0:8888")
addr := flag.String("target", "127.0.0.1:10001", "the proxy target address, ex: my.ray.server:10001")
flag.Parse()
ln, err := net.Listen("tcp", *bind)
if err != nil {
log.Fatalf("fail to listen on %s: %v\n", *bind, err)
}
cc, err := grpc.Dial(*addr, grpc.WithTransportCredentials(insecure.NewCredentials()))
if err != nil {
log.Fatalf("fail to dial on %s: %v\n", *addr, err)
}
server := proxy.NewProxy(cc,
grpc.MaxConcurrentStreams(math.MaxUint32),
grpc.MaxRecvMsgSize(math.MaxUint32),
grpc.MaxSendMsgSize(math.MaxUint32),
grpc.KeepaliveParams(keepalive.ServerParameters{
Time: 1000 * 30,
Timeout: 1000 * 600,
}))
log.Printf("server listens on %s to proxy %s\n", *bind, *addr)
if err := server.Serve(ln); err != nil {
log.Fatalf("server exit with err: %v\n", err)
}
} Then the driver will print out "Log channel is reconnecting. Logs produced while the connection was down can be found on the head node of the cluster in Sorry for the inconvenience to golang, but this is the only proxy I have now. |
Hi @architkulkarni, I have found the cause and have created a PR for the "Exception iterating requests!" error. |
…3865) Signed-off-by: Rueian <[email protected]>
This fixes the below grpc error mentioned in #23865. grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNKNOWN details = "Exception iterating requests!" debug_error_string = "None" > This error happens when proxying the grpc stream to the individual SpecificServer but the incoming grpc stream is already canceled. (https://github.com/grpc/grpc/blob/v1.43.0/src/python/grpcio/grpc/_server.py#L353-L354) Therefore, just wrap the request_iterator with a new RequestIteratorProxy which will catch the _CANCELLED signal. Related issue number #23865
Closing this issue as the root cause has been found and PR merged |
…3865) (ray-project#27951) This fixes the below grpc error mentioned in ray-project#23865. grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNKNOWN details = "Exception iterating requests!" debug_error_string = "None" > This error happens when proxying the grpc stream to the individual SpecificServer but the incoming grpc stream is already canceled. (https://github.com/grpc/grpc/blob/v1.43.0/src/python/grpcio/grpc/_server.py#L353-L354) Therefore, just wrap the request_iterator with a new RequestIteratorProxy which will catch the _CANCELLED signal. Related issue number ray-project#23865 Signed-off-by: Weichen Xu <[email protected]>
What happened + What you expected to happen
When utilizing a Ray Cluster, and setting the environment variables of the pod i.e the gRPC proxy and http proxy variables, then I am unable to connect to the remote cluster using ray.init(). This only occurs when proxies are set, and I assure that my no_proxy environment variables are set properly.
Are there specific configurations that need to be set when running behind a proxy? I am rebuilding the image with those specified environment variables. Essentially, I am trying to use runtime_environments to install pip packages; however, when the proxy settings are set, this inhibits the gRPC client is unable to connect to the client.
These are the specific errors I receive:
I was able to workaround this issue by manually setting the pip environment variables to utilize the proxy while the gRPC server does not. However, I do not believe this is the best workaround. However, I still do receive some proxy errors in the head nodes logs:
Additionally, I have noticed issues when a worker node tries to install the pip packages. The Ray head node is able to, however worker nodes are not. Any insight would be appreciated.
Versions / Dependencies
Ray v1.11.0
Python 3.7.12
Reproduction script
Using ray.init in a Jupyter Notebook
The text was updated successfully, but these errors were encountered: