You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
I am trying set up a multinode training code using deepspeed (0.12.2) with torch (2.0.1) and the PDSH (2.35) launcher.
I successfully installed PDSH by hand and all works fine on single node.
However, in 2 node settings, the code blocks in the torch initialization step.
Digging into the details, it seems that it is not managing to establish the inter-node communication (timeout waiting for workers on 2nd node to connect to the master).
I assume this is an NCCL/Infiniband issue, probably miss-configured on my side, however I am having trouble validating this hypothesis as most diagnostic tools require root.
Do you have any intuition as to why this is not working or solutions/options/configurations I should explore please? Can you see any issues in the NCCL config bellow, or could you confirm if everything is working on your side?
Thanks for the help!
Hello,
I am trying set up a multinode training code using deepspeed (0.12.2) with torch (2.0.1) and the PDSH (2.35) launcher.
I successfully installed PDSH by hand and all works fine on single node.
However, in 2 node settings, the code blocks in the torch initialization step.
Digging into the details, it seems that it is not managing to establish the inter-node communication (timeout waiting for workers on 2nd node to connect to the master).
I assume this is an NCCL/Infiniband issue, probably miss-configured on my side, however I am having trouble validating this hypothesis as most diagnostic tools require root.
Do you have any intuition as to why this is not working or solutions/options/configurations I should explore please? Can you see any issues in the NCCL config bellow, or could you confirm if everything is working on your side?
Thanks for the help!
Bellow are the NCCL environment variables set:
The code I am using to test this:
Command used to run:
deepspeed --hostfile [absolute path]/[jobid]-hosts [absolute path]/test.py
Sample hostfile:
The text was updated successfully, but these errors were encountered: